install.packages("rio")
install.packages("tidyverse")
install.packages("bupaverse")
install.packages("processanimateR")
14 The Why, the How and the When of Educational Process Mining in R
bupaverse
framework for data handling and visualization. We finish the chapter with a reflection on the method and its reliability and applicability.
1 Introduction
Nowadays, almost all learning platforms generate vast amounts of data that include every interaction a student has with the learning environment. Such large amounts of data offer a unique opportunity to analyze and understand the dynamics of the learning process. In the previous chapters of the book, we covered several methods for analyzing the temporal aspects of learning, such as sequence analysis [1, 2], Markovian modeling [3] or temporal network analysis [4]. In this chapter, we present an analytical method that is specifically oriented at analyzing time-stamped event log data: process mining. Process mining is a technique that allows us to discover, visualize, and analyze the underlying process from time-stamped event logs. Through process mining, we may uncover hidden patterns, bottlenecks, and inefficiencies in students’ learning journeys. By tracking students’ actions step-by-step, we can identify which resources are most effective, which topics are more challenging, and even predict possible problems before they may occur.
Process mining emerged as a business tool that allows organizations to analyze and improve their operational processes. The field has rapidly expanded with several modelling methods, algorithms, tools and visualization techniques. Further, the method has been met with enthusiasm from several researchers leading to a rapid uptake by other disciplines such as health care management and education. As the field currently stands, it is a blend of process management and data science with less emphasis on statistical methods. The field has found its place in educational research with the recent surge of trace log data generated by students’ activities and the interest that learning analytics and educational data mining have kindled.
This tutorial chapter will introduce the reader to the fundamental concepts of process mining technique and its applications in learning analytics and education research at large. We will first describe the method, the main terminology, and the common steps of analysis. Then, we will provide a review of related literature to gain an understanding of how this method has been applied in learning analytics research. We then provide a step-by-step tutorial on process mining using the R programming language and the bupaverse
framework. In this tutorial, we analyze a case study of students’ online activities in an online learning management system using the main features of process mining.
2 Basic steps in process mining
The goal of process mining is to extract process models from event data. The resulting models can then be used to portray students’ pathways in learning, identify common transitions, and find issues of their approach. In doing so, process mining promises to find deviations from the norm, suggest corrective actions, and optimize processes as an ultimate goal [5]. Process mining starts by the extraction of event data. In the case of educational process mining, event data often reflects students’ activities in learning management systems (LMSs), or in other types of digital tools that record time-stamped events of students’ interactions with the digital tools, such as automated assessment tools or online learning games. The said data is used to construct what is known as an event log. An event log has three necessary parts:
Case identifier: A case represents the subject of the process. For example, if we are analyzing students’ enrollment process, each student would represent a different case. If we are analyzing students’ online activities in the LMS, we can also consider each student as a separate case; alternatively, if we want a greater level of granularity, each student’s learning session can be then considered a separate case. All event logs need to have a case identifier that unequivocally identifies each case and that allows to group together all the events that belong to the same case.
Activity identifier: Activities represent each action or event in the event data. Continuing with the previous examples, an activity would represent each step in the enrollment process (e.g, application, revision, acceptance, payment etc.), or each action in the LMS (e.g., watch video, read instructions, or check calendar).
Timestamp: The timestamp is a record of the moment each event has taken place. It allows to establish the order of the events. In the case of online activity data, for instance, the timestamp would be the instant in which a student clicks on a learning resource. In some occasions, activities are not instantaneous, but rather have a beginning and an end. For example, if we are analyzing student’s video watching, we might have an event record when they start watching and when they finish. If we want to treat these events as parts of the same activity, we need to provide additional information when constructing an event log. As such, we need to specify an activity instance identifier, which would allow us to unequivocally identify and group together all instances of the same overarching activity (watching a video, in our example). Moreover, we would need to provide a lifecycle identifier (e.g., start and end), to differentiate between all stages of the same activity. A common limitation of LMS data is that only one click is recorded per activity so this type of analysis is often not possible.
Once we have our event log defined, we can calculate multiple metrics that allow us to understand the data. For example, we can see the most frequent activities and the most frequent transitions. We can also see the case coverage for each activity, e.g., how many cases contain each activity, and the distribution of the length of each case (how many activities they have). We can also calculate performance metrics, such as idle time (i.e., time spent without doing any activities) or throughput time (i.e., overall time taken).
From the event log, we often construct what is known as the Directly-Follows Graph (DFG), in which we graphically represent all the activities in the event log as nodes and all the observed transitions between them as edges [5]. Figure 14.1 shows an example with event log data from students, where each case identifier represents a student’s session. First, we build the sequence of activities for each case. As such, Layla’s path for her first learning session would be: Calendar → Lecture → Video → Assignment. The path for Sophia’s first session would be: Calendar → Instructions → Video → Assignment. We put both paths together and construct a partial DFG that starts from Calendar, then it transitions either to Lecture or Instructions and then it converges back into Video and ends in Assignment. We create the paths for the remaining student sessions. Then, we combine them together through an iterative process until we have the complete graph with all the possible transitions between activities. Our final graph with the four sessions shown in Figure 14.1 would start by Calendar, then transition either to Lecture or Instructions. Then the process could transition from Lecture to Instructions or viceversa, or to Video. In addition, Lecture has a self-loop because it can trigger another lecture (see Layla’s session 2). From Video, the only possible transition is to Assignment.
In real-life scenarios, building the DFG for a complete —or large— event log may turn to be overcrowded and hard to visualize or interpret [5]. Therefore, it is common to trim activities or transitions that do not occur often. Other options include splitting the event logs by group (e.g., class sections) to reduce the granularity of the event log to be able to compare processes between groups. We can also zoom into specific parts of the course (e.g, a specific assignment or lecture) to better understand students’ activities at that time. Moreover, we can filter the event log to see cases that contain specific activities or that match certain conditions.
The DFGs are often enhanced with labels that allow us to understand the graph better. These labels are often based on the frequency of activities and transitions. For example, the nodes (representing each activity) can be labeled with the number of times (or proportion) they appear in the event data, or with the case coverage, i.e., how many cases (or what percentage thereof) they appear in. The edges (representing transitions between activities) can be labeled with the frequency of the transitions or the case coverage as well, among others. Another common way of labeling the graph is using performance measures that indicate the mean time (or median, maximum, etc.) taken by each activity and/or transition.
The DFG gives us an overarching view of the whole event log. However, in some occasions, we would like to understand the underlying process that is common to most cases of our event log. This step is called process discovery [5] and there are several algorithms to perform it such as the alpha algorithm [6], inductive techniques [7] or region-based approaches [8]. The discovered processes are then represented using specialized notation, such as Business Process Model and Notation (BPMN) [9] or Petri nets [10].
In some occasions, there are certain expectations regarding how the process should go. For instance, if we are analyzing students’ activities during a lesson, the teacher might have given a list of instructions that students are expected to follow in order. To make sure that the discovered process (i.e., what students’ data reveal) aligns with the expected process (i.e., what students were told to do in our example), conformance checking is usually performed. Conformance checking deals with comparing the discovered process with an optimal model or theorized process (i.e., an ideal process) [5]. The idea is to find similarities and differences between the model process and the real observed process, identify unwanted behavior, detect outliers, etc. As seen from our example, in educational settings, this can be used, for instance, to detect whether students are following the learning materials in the intended order or whether they implement the different phases of self-regulated learning. However, given that students are rarely asked to access learning materials in a strict sequential way, this feature has been rarely used. In the next section, we present a review of the literature on educational process mining where we discuss more examples.
3 Review of the literature
A review of the literature by Bogarín et al. [11] mapped the landscape of educational process mining and found a multitude of applications of this technique in a diversity of contexts. The most common application was to investigate the sequence of students’ activities in online environments such as MOOCs and other online or blended courses, as well as in computer-supported collaborative learning contexts [11]. One of the main aims was to detect learning difficulties to be able to provide better support for students. For example, López-Pernas et al. [12] used process mining to explore how students’ transition between a learning management system and an automated assessment tool, and identified how struggling students make use of the resources to solve their problems. Arpasat et al. [13] used process mining to study students’ activities in a MOOC, and compared the behavior of high- and low-achieving students in terms of students’ activities, bottlenecks and time performance. A considerable volume of research has studied processes where the events are instantaneous, such as clicks in online learning management systems (e.g., [14–16]). Fewer are the studies that have used activities with a start time and an end time due to the limitations in data collection in online platforms. However, this limitation has been often overcome by grouping clicks into learning sessions, as is often done in the literature on students’ learning tactics and strategies, or self-regulated learning (e.g., [12, 17–19]).
Regarding the methods used, much of the existing research is limited to calculating performance metrics and visualizing DFGs, whereby researchers attempt to understand the most commonly performed activities and the common transitions between them. For example, Vartiainen et al. [20] used video coded data of students’ participation in an educational escape room to visualize the transitions between in-game activities using DFGs. Oftentimes, researchers use DFGs to compare (visually) across groups, for example high vs. low achievers, or between clusters obtained through different methods. For instance, Saqr et al. [21] implemented k-means clustering to group students according to their online activity frequency, and used DFGs to understand the strategies adopted by the different types of learners and how they navigate their learning process. Using a different approach, Saqr and López-Pernas [22] clustered students groups according to their sequence of interactions using distance-based clustering, and then compared the transitions between different interactions among the clusters using DFGs.
Going one step further, several studies have used process discovery to detect the underlying overall process behind the observed data [11]. A variety of algorithms have been used in the literature for this purpose, such as the alpha algorithm [23], the heuristic algorithm [24], or the fuzzy miner [18]. Less often, research on educational proecss mining has performed conformance checks [11], comparing the observed process with an “ideal” or “designed” one. An example is the work by Pechenizkiy et al. [25], who used conformance checking to verify whether students answered an exam’s questions in the order specified by the teacher.
When it comes to the tools used for process mining, researchers have relied on various point-and-click software tools [11]. For example, Disco [26] has been used for DFG visualization by several articles (e.g., [27]). ProM [28] is the dominant technology when it comes to process discovery (e.g., [19, 29]) and also conformance checking (e.g., [25]). Many articles have used the R programming language to conduct process mining, relying on the bupaverse
[30] framework for basic metrics and visualization (the one covered in the present chapter), although not for process discovery (e.g., [12]) since the algorithm support is scarce.
4 Process mining with R
In this section, we present a step-by-step tutorial with R on how to conduct process mining of learners’ data. First, we will install and load the necessary libraries. Then, we will present the data that we will use to illustrate the process mining method.
4.1 The libraries
As a first step, we need two basic libraries that we have used multiple times throughout the book: rio
for importing the data [31], and tidyverse
for data manipulation [32]. As for the libraries used for process mining, we will first rely on bupaverse
, a meta-package that contains many relevant libraries for this purpose (e.g., bupaR
), which will help us with the frequentist approach [30]. We will use processanimateR
to see a play-by-play animated representation of our event data. You can install the packages with the following commands:
You can then load the packages using the library()
function.
library(bupaverse)
library(tidyverse)
library(rio)
library(processanimateR)
4.2 Importing the data
The dataset that we are going to analyze with process mining contains logs of students’ online activities in an LMS during their participation on a course about learning analytics. We will also make use of students’ grades data to compare activities between high and low achievers. More information about the dataset can be found in the data chapter of this book [33]. In the following code chunk, we download students’ event and demographic data and we merge them together into the same dataframe (df
).
<- import("https://github.com/lamethods/data/raw/main/1_moodleLAcourse/Events.xlsx")
df <- import("https://github.com/lamethods/data/raw/main/1_moodleLAcourse/AllCombined.xlsx") |>
all select(User, AchievingGroup)
<- df |> merge(all, by.x = "user", by.y = "User") df
When analyzing students’ learning event data, we are often interested in analyzing each learning session separately, rather than considering a longer time span (e.g., a whole course). A learning session is a sequence (or episode) of un-interrupted learning events. To do such grouping, we define a threshold of inactivity, after which, new activities are considered to belong to a new episode of learning or session. In the following code, we group students’ logs into learning sessions considering a threshold of 15 minutes (15 min. × 60 sec./min. = 900 seconds), in a way that each session will have its own session identifier (session_id
). For a step-by-step explanation of the sessions, code and rationale, please refer to the sequence analysis chapter [1]. A preview of the resulting dataframe can be seen below. We see that each group of logs that are less than 900 seconds (15 minutes) apart (Time_gap
column) are within the same session (new_session = FALSE
) and thus have the same session_id
. Logs that are more than 900 seconds apart are considered a new session (new_session = TRUE
) and get a new session_id
.
<- df |>
sessioned_data group_by(user) |>
arrange(user, timecreated) |>
mutate(Time_gap = timecreated - (lag(timecreated))) |>
mutate(new_session = is.na(Time_gap) | Time_gap > 900) |>
mutate(session_nr = cumsum(new_session)) |>
mutate(session_id = paste0 (user, "_", "Session_", session_nr)) |> ungroup()
sessioned_data