Select Language

Project MOSLA: A Multimodal, Longitudinal Dataset for Second Language Acquisition Research

Overview of Project MOSLA, a unique longitudinal, multimodal, and multilingual dataset capturing the complete second language acquisition process over two years.
study-chinese.com | PDF Size: 9.7 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - Project MOSLA: A Multimodal, Longitudinal Dataset for Second Language Acquisition Research

1. Introduction

Second language acquisition (SLA) is a complex, dynamic process that has traditionally been studied through fragmented, unimodal, or short-term datasets. Project MOSLA (Moments of Second Language Acquisition) addresses these limitations by creating a pioneering longitudinal, multimodal, multilingual, and controlled dataset. The project documents learners acquiring Arabic, Spanish, or Chinese from scratch over two years via exclusive online instruction, recording every lesson. This dataset, comprising over 250 hours of video, audio, and screen recordings, paired with semi-automated annotations, provides an unprecedented resource for studying the nuanced trajectory of language learning.

2. Data Collection Methodology

The MOSLA dataset was constructed under a rigorous, controlled protocol to ensure consistency and research validity.

2.1 Participant Recruitment & Language Selection

Participants were recruited to learn one of three target languages: Arabic, Spanish, or Mandarin Chinese. The selection includes languages with non-Latin alphabets (Arabic and Chinese), expanding the dataset's cross-linguistic applicability beyond commonly studied Indo-European languages.

2.2 Controlled Learning Environment

A key design feature is the controlled exposure mandate. Participants agreed to learn the target language only through the provided online lessons for the duration of the two-year study. This control minimizes confounding variables from external language exposure, allowing for clearer attribution of proficiency gains to the instructional method.

2.3 Multimodal Recording Setup

All lessons were conducted and recorded via Zoom, capturing three synchronized streams:

  • Video: Participant and instructor webcam feeds.
  • Audio: Full lesson audio.
  • Screen Share: The instructor's shared screen containing teaching materials, slides, and applications.

This triad creates a rich, contextualized record of the learning interaction.

Dataset at a Glance

  • Duration: ~2 years per participant
  • Total Recordings: >250 hours
  • Modalities: Video, Audio, Screen
  • Target Languages: 3 (Arabic, Spanish, Chinese)
  • Control: Exclusive online instruction

3. Data Annotation Pipeline

Raw recordings were processed through a semi-automated pipeline to generate structured, queryable metadata.

3.1 Semi-Automated Annotation Framework

Annotations were produced using a hybrid human-machine approach:

  1. Speaker Diarization: Segmenting audio into speaker-homogeneous regions ("who spoke when?").
  2. Speaker Identification: Labeling segments as 'instructor' or 'learner'.
  3. Language Identification: Tagging segments by language (e.g., L1/English vs. Target Language).
  4. Automatic Speech Recognition (ASR): Generating transcripts for all speech segments.

Initial annotations were created by human annotators, forming a gold-standard subset used to fine-tune state-of-the-art models.

3.2 Model Fine-tuning & Performance

Pre-trained models (e.g., for ASR, diarization) were fine-tuned on the human-annotated MOSLA data. The paper reports substantial performance improvements after fine-tuning, demonstrating the value of domain-specific data even for large pre-trained models. This step was crucial for scaling annotation to the entire 250+ hour corpus.

4. Linguistic & Multimodal Analysis

The annotated dataset enables novel analyses of the SLA process.

4.1 Proficiency Development Metrics

Longitudinal trends were analyzed using metrics such as:

  • Target Language Ratio: The percentage of learner utterances in the target language vs. their native language over time.
  • Lexical Diversity: Measuring vocabulary growth and complexity (e.g., via Type-Token Ratio).
  • Utterance Length & Complexity: Tracking the development of syntactic structures.

These metrics paint a quantitative picture of proficiency development across the two-year journey.

4.2 Screen Focus Detection

A particularly innovative analysis involved using multimodal deep learning models to predict the learner's area of focus on the shared screen purely from the unannotated video and audio signals. By correlating audio cues (e.g., discussing a specific word) with screen content, the model can infer what the learner is looking at, offering insights into attention and engagement.

5. Core Insight & Analyst Perspective

Core Insight: Project MOSLA isn't just another dataset; it's a foundational infrastructure play that exposes the critical gap between isolated, snapshot SLA studies and the messy, continuous reality of learning. Its value proposition lies in controlled longitudinality—a feature as rare as it is essential. While projects like the Mozilla Common Voice corpus democratize speech data, they lack the structured learning trajectory and multimodal context that MOSLA provides. Similarly, the BEA-2019 Shared Task focused on isolated writing proficiency, missing the rich, interactive dimension captured here.

Logical Flow: The project's logic is elegantly linear: 1) Identify a methodological vacuum (lack of controlled, multimodal, longitudinal SLA data), 2) Engineer a solution (rigorous participant protocol + Zoom recording), 3) Solve the scaling problem (human-in-the-loop ML annotation), and 4) Demonstrate utility (linguistic analysis + novel multimodal tasks). This end-to-end pipeline from data creation to application is a blueprint for empirical learning sciences.

Strengths & Flaws: The strength is undeniable: scale, control, and multimodal richness. It's a researcher's dream for studying temporal dynamics. However, the flaws are in the trade-offs. The "controlled" environment is also its biggest artificiality—real-world language acquisition is gloriously uncontrolled. The sample size, while creating a deep longitudinal dataset, may limit generalizability across diverse learner populations. Furthermore, the technical barrier to utilizing such a complex multimodal dataset remains high, potentially limiting its immediate adoption.

Actionable Insights: For researchers, the immediate action is to explore this open dataset. For EdTech companies, the insight is to move beyond simple completion metrics and model the process of learning as MOSLA does. The screen-focus detection experiment alone suggests a future where learning platforms infer cognitive engagement in real-time. The larger imperative is for the field to shift from cross-sectional "photos" to longitudinal "films" of learning. MOSLA has built the camera; it's now time for the community to start making the movies.

6. Technical Implementation Details

The annotation pipeline relies on several machine learning models. A simplified view of the speaker diarization and identification task can be framed as an optimization problem. Let $X = \{x_1, x_2, ..., x_T\}$ represent the sequence of audio features. The goal is to find the sequence of speaker labels $S = \{s_1, s_2, ..., s_T\}$ and speaker identities $Y = \{y_1, y_2, ..., y_K\}$ that maximize the posterior probability:

$P(S, Y | X) \propto P(X | S, Y) \cdot P(S) \cdot P(Y)$

Where:

  • $P(X | S, Y)$ is the likelihood of the audio features given the speaker segments and identities, often modeled using Gaussian Mixture Models (GMMs) or deep neural network embeddings like x-vectors.
  • $P(S)$ is a prior over speaker turn dynamics, encouraging temporal continuity (e.g., using a hidden Markov model).
  • $P(Y)$ represents the prior knowledge of speaker identities (instructor vs. learner).

Fine-tuning on MOSLA data primarily improves the estimation of $P(X | S, Y)$ by adapting the acoustic model (e.g., the x-vector extractor) to the specific acoustic conditions and speaker characteristics of the online classroom.

7. Experimental Results & Findings

The paper presents key findings from analyzing the MOSLA dataset:

  • Proficiency Trajectories: Graphs show a clear, non-linear increase in the percentage of target language use by learners over time, with plateaus and jumps corresponding to different instructional units. Lexical diversity metrics show a steady upward trend, accelerating after the first six months.
  • Model Performance Gains: Fine-tuning a pre-trained Wav2Vec2.0 model for ASR on just 10 hours of MOSLA human transcripts reduced the Word Error Rate (WER) by over 35% on held-out MOSLA data compared to the base model. Similar significant improvements are reported for speaker and language identification tasks.
  • Screen Focus Detection: A multimodal model (e.g., a vision transformer for screen frames combined with an audio encoder) was trained to classify the broad area of screen focus (e.g., "slide text," "video," "whiteboard"). The model achieved an accuracy significantly above chance, demonstrating that audio-visual correlation contains meaningful signals about learner attention, even without eye-tracking hardware.

Figure 1 (Conceptual): The paper includes a conceptual figure illustrating the MOSLA pipeline: Data Collection (Zoom recordings) -> Data Annotation (Diarization, ID, ASR) -> Multimodal Analysis (Screen focus) & SLA Linguistic Analysis (Proficiency metrics). This figure underscores the project's comprehensive, pipeline-oriented approach.

8. Analysis Framework: Proficiency Trajectory Modeling

Case: Modeling the "Target Language Use" Trajectory

Researchers can use the MOSLA dataset to build growth curve models. A simplified example analyzes the weekly ratio of target language (TL) utterances by a learner. Let $R_t$ be the TL ratio at week $t$.

A basic linear mixed-effects model could be specified as:

R_t ~ 1 + Time_t + (1 + Time_t | Learner_ID)
        

Where:

  • 1 + Time_t models the fixed effect of an overall intercept and slope (average growth trajectory).
  • (1 + Time_t | Learner_ID) allows both the starting point (intercept) and growth rate (slope) to vary randomly across individual learners.

Using the MOSLA data, one could fit this model (e.g., using R's lme4 or Python's statsmodels) to estimate the average weekly increase in TL use and the degree of individual variability. More complex models could include instructional phase as a predictor or model non-linear growth using polynomial or spline terms for Time. This framework moves beyond comparing pre- and post-tests to modeling the entire learning curve.

9. Future Applications & Research Directions

The MOSLA dataset opens numerous avenues for future work:

  • Personalized Learning Pathways: Algorithms could analyze a learner's early trajectory in MOSLA to predict future stumbling blocks and recommend personalized review or practice materials.
  • Automated Proficiency Assessment: Developing fine-grained, continuous assessment models that go beyond standardized tests, using multimodal cues (fluency, lexical choice, pronunciation, engagement) as in ETS's research on automated speaking assessment.
  • Teacher Analytics: Analyzing instructor strategies and their correlation with learner progress, providing data-driven feedback for teacher training.
  • Cross-Linguistic Transfer Studies: Comparing acquisition patterns between Arabic, Spanish, and Chinese to understand how language-specific features (e.g., tonal system, script) affect the learning process.
  • Multimodal Foundation Models: MOSLA is an ideal training ground for building multimodal AI models that understand educational dialogue, potentially leading to more sophisticated AI tutors.
  • Expansion: Future iterations could include more languages, larger and more diverse participant pools, biometric data (like heart rate for stress/cognitive load), and integration with learning management system (LMS) data.

10. References

  1. Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic Linguistic Annotation of Large Scale L2 Databases: The EF-Cambridge Open Language Database (EFCAMDAT). In Proceedings of the 9th Workshop on Innovative Use of NLP for Building Educational Applications.
  2. Settles, B., T. LaFlair, G., & Hagiwara, M. (2018). Machine Learning-Driven Language Assessment. Transactions of the Association for Computational Linguistics.
  3. Stasaski, K., Devlin, J., & Hearst, M. A. (2020). Measuring and Improving Semantic Diversity of Dialogue Generation. In Findings of the Association for Computational Linguistics: EMNLP 2020.
  4. Hampel, R., & Stickler, U. (2012). The use of videoconferencing to support multimodal interaction in an online language classroom. ReCALL, 24(2), 116-137.
  5. Mozilla Common Voice. (n.d.). Retrieved from https://commonvoice.mozilla.org/
  6. Educational Testing Service (ETS). (2021). Automated Scoring of Speech. Research Report.
  7. Hagiwara, M., & Tanner, J. (2024). Project MOSLA: Recording Every Moment of Second Language Acquisition. arXiv preprint arXiv:2403.17314.