1. Introduction & Background
Predictive modeling in education, particularly Knowledge Tracing (KT), aims to model a student's evolving knowledge state to forecast future performance and personalize instruction. Traditional methods relying on human interpretation of performance data are prone to cognitive biases (e.g., positivity bias, memory limits). Computational KT, introduced by Corbett and Anderson, mitigates these by using student interaction data.
While most research prioritizes model accuracy, this paper shifts focus to a critical yet underexplored dimension: algorithmic fairness. Fairness ensures models do not systematically disadvantage groups based on sensitive attributes (e.g., device type, country of origin). In the context of Second Language Acquisition (SLA) via platforms like Duolingo, bias could perpetuate educational inequity.
Core Research Questions: This study evaluates the fairness of KT models across: 1) Different client platforms (iOS, Android, Web), and 2) Learners from developed versus developing countries.
2. Methodology & Experimental Setup
The study employs a comparative analysis framework to evaluate both the predictive performance and fairness of models.
2.1 Datasets: Duolingo Tracks
Three distinct learning tracks from the 2018 Duolingo Shared Task on Second Language Acquisition were used:
- en_es: English speakers learning Spanish.
- es_en: Spanish speakers learning English.
- fr_en: French speakers learning English.
2.2 Predictive Models Evaluated
The study compares two broad classes of models:
- Machine Learning (ML) Models: Likely includes traditional models like Logistic Regression, Random Forests, or Bayesian Knowledge Tracing (BKT).
- Deep Learning (DL) Models: Likely includes sequence models like Long Short-Term Memory (LSTM) networks or Deep Knowledge Tracing (DKT), which are adept at capturing temporal dependencies in learning sequences.
2.3 Fairness Metrics & Evaluation Framework
Fairness was assessed using group fairness metrics. For a binary prediction (e.g., will the student answer the next item correctly?), common metrics include:
- Demographic Parity: Equal prediction rates across groups.
- Equal Opportunity: Equal true positive rates across groups.
- Predictive Parity: Equal precision across groups.
3. Experimental Results & Findings
The analysis yielded four key findings, highlighting trade-offs between accuracy and fairness.
Key Findings at a Glance
- DL Superiority: DL models generally outperformed ML in both accuracy and fairness.
- Mobile Bias: Both ML and DL showed bias favoring mobile (iOS/Android) over web users.
- Development Bias: ML models exhibited stronger bias against learners from developing countries than DL models.
- Context-Dependent Choice: Optimal model choice (DL vs. ML) depends on the specific learning track.
3.1 Performance: Accuracy Comparison
Deep Learning models demonstrated a marked advantage in predictive accuracy across the evaluated tracks. This aligns with the established capability of neural sequence models like DKT to model complex, non-linear learning trajectories more effectively than simpler ML models, as noted in the seminal DKT paper by Piech et al.
3.2 Fairness Across Client Platforms
A consistent and noticeable bias was observed favoring mobile app users (iOS, Android) over web browser users. This could stem from:
- Data quality differences (e.g., interaction patterns, session lengths).
- Unintended correlation between platform choice and learner engagement or socioeconomic factors baked into the training data.
3.3 Fairness Across Country Development Levels
Machine Learning algorithms showed a more pronounced bias against learners from developing countries compared to Deep Learning algorithms. This suggests that DL models, with their greater capacity, might be learning more robust, generalizable patterns that are less sensitive to spurious correlations linked to development status.
3.4 Trade-off Analysis: Accuracy vs. Fairness
The study recommends a nuanced, context-specific approach:
- For en_es and es_en tracks, Deep Learning is more apt, offering a better balance.
- For the fr_en track, Machine Learning emerged as a more suitable option, potentially due to dataset characteristics where simpler models generalize more fairly.
4. Technical Deep Dive
4.1 Knowledge Tracing Formalism
At its core, KT models a learner's knowledge state as a latent variable that evolves over time. Given a sequence of learner interactions (e.g., exercise attempts) $X = \{x_1, x_2, ..., x_t\}$, the goal is to predict the probability of correctness on the next item, $P(r_{t+1} = 1 | X)$.
Deep Knowledge Tracing (DKT) uses a Recurrent Neural Network (RNN) to model this:
$h_t = \text{RNN}(x_t, h_{t-1})$
$P(r_{t+1}) = \sigma(W \cdot h_t + b)$
where $h_t$ is the hidden state representing the knowledge state at time $t$, and $\sigma$ is the sigmoid function.
4.2 Fairness Metrics Formulation
Let $A \in \{0,1\}$ be a sensitive attribute (e.g., $A=1$ for mobile user, $A=0$ for web user). Let $\hat{Y}$ be the model's prediction. Demographic Parity requires:
$P(\hat{Y}=1 | A=1) = P(\hat{Y}=1 | A=0)$
Equal Opportunity (considering correctness as the positive outcome) requires:
$P(\hat{Y}=1 | A=1, Y=1) = P(\hat{Y}=1 | A=0, Y=1)$
The bias observed in the study can be quantified as the difference or ratio between these conditional probabilities for different groups.
5. Analysis Framework & Case Example
Framework for Auditing KT Fairness: Edtech developers can adopt this structured approach:
- Disaggregate Evaluation: Never report only aggregate accuracy. Always calculate performance metrics (accuracy, AUC) and fairness metrics (demographic parity difference, equal opportunity difference) separately for each sensitive subgroup (by platform, country, gender if available).
- Root Cause Analysis: For identified biases, investigate feature correlations. Is "number of sessions" correlated with both platform and prediction outcome? Could proxy variables for socioeconomic status be leaking into the model via behavioral data?
- Mitigation Strategy Selection: Based on the cause, choose a mitigation technique: pre-processing (re-weighting data), in-processing (adding fairness constraints to the loss function, as in approaches like those from the FAT* conference community), or post-processing (calibrating thresholds per group).
Case Example - The Mobile Bias: Imagine an LSTM-based KT model trained on Duolingo data shows a 15% higher predicted probability of success for iOS users vs. Web users, holding actual performance constant. Our audit reveals the "time-of-day" feature is a key driver: iOS users practice more in short, frequent bursts (commutes), while Web users have longer, less frequent sessions. The model associates the "commute pattern" with higher engagement and boosts predictions, unfairly penalizing Web users who may learn effectively in different patterns. Mitigation: We could apply a fairness-aware regularization term during training that penalizes the model for differences in prediction distributions between the platform groups, guided by the work of researchers like Zemel et al. on learning fair representations.
6. Critical Analysis & Expert Interpretation
Core Insight: This paper delivers a crucial, uncomfortable truth for the booming EdTech sector: your state-of-the-art knowledge tracing models are likely baking in systemic biases that favor affluent, mobile-first users and developed nations. The pursuit of accuracy has blinded the field to the ethical debt accumulating in its algorithms. The finding that bias persists even in sophisticated Deep Learning models is a sobering counterpoint to the belief that more complex models inherently learn "fairer" representations.
Logical Flow: The authors logically progress from establishing the KT paradigm to exposing its fairness blind spot. Using the well-established Duolingo dataset provides credibility and reproducibility. The bifurcated analysis—platform bias and geopolitical bias—cleverly captures two major axes of digital divide. The comparison between classical ML and modern DL is not just technical but strategic, helping practitioners choose tools with ethical implications in mind.
Strengths & Flaws: The primary strength is its actionable, empirical focus on real-world data and clear, comparative findings. It moves beyond theoretical fairness discussions. However, a significant flaw is the lack of mechanistic explanation. Why does mobile bias occur? Is it data artifact, user behavior difference, or model limitation? The paper diagnoses the disease but offers scant pathology. Furthermore, the suggestion to use ML for the `fr_en` track based on fairness, despite its lower accuracy, presents a real-world dilemma: how much accuracy are we willing to sacrifice for fairness, and who decides?
Actionable Insights: For product leaders and engineers, this study is a mandate for change. First, fairness auditing must become a standard KPI alongside A/B testing for new model deployments, akin to practices advocated by Google's PAIR initiative. Second, the observed biases suggest a need for platform-specific feature engineering or calibration. Perhaps web users require a subtly different predictive model. Third, the research underscores the need for more diverse and representative training data. Collaborations with NGOs or educational bodies in developing regions could help rebalance datasets. Finally, the field must develop and adopt "Fairness-by-Design" KT architectures, integrating constraints from the outset, rather than retrofitting fairness as an afterthought.
7. Future Applications & Research Directions
- Personalized Fairness-Aware Tutoring: Future ITS can dynamically adjust not just for knowledge state, but also to counteract predicted biases. If the system detects a student is from an underrepresented group for which the model is less confident, it could provide more supportive scaffolding or gather more data to reduce uncertainty fairly.
- Cross-Cultural & Cross-Linguistic Model Transfer: Research should explore fairness in transfer learning. Is a KT model trained on English-speaking learners fair when fine-tuned for Spanish speakers? Techniques from domain adaptation could be merged with fairness constraints.
- Explainable Fairness (XFairness): Beyond measuring bias, we need tools to explain which features contribute to unfair outcomes. This aligns with the broader XAI (Explainable AI) movement and is critical for developer trust and effective mitigation.
- Longitudinal Fairness Studies: Does algorithmic bias increase or decrease over a learner's multi-year journey? Longitudinal studies are needed to understand the compounding effects of biased feedback loops in adaptive systems.
- Integration with Learning Science: Future work must bridge the gap with pedagogical theory. What does "fairness" mean from a cognitive load or motivational perspective? Fairness should align with educational equity principles, not just statistical parity.
8. References
- Corbett, A. T., & Anderson, J. R. (1994). Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4(4), 253-278.
- Piech, C., Bassen, J., Huang, J., Ganguli, S., Sahami, M., Guibas, L. J., & Sohl-Dickstein, J. (2015). Deep knowledge tracing. Advances in neural information processing systems, 28.
- Zemel, R., Wu, Y., Swersky, K., Pitassi, T., & Dwork, C. (2013). Learning fair representations. International conference on machine learning (pp. 325-333). PMLR.
- Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1-35.
- Google PAIR. (n.d.). People + AI Guidebook. Retrieved from https://pair.withgoogle.com/
- Duolingo. (2018). Duolingo Second Language Acquisition Shared Task. Proceedings of the 2018 EMNLP Workshop W-NUT.
- Barocas, S., Hardt, M., & Narayanan, A. (2019). Fairness and Machine Learning: Limitations and Opportunities. fairmlbook.org.