2.1 L1 Pretraining Phase
A masked language model (e.g., based on architectures like BERT) is pretrained from scratch on a monolingual corpus of a chosen L1. This phase establishes the model's initial linguistic "native" competence.
This work investigates the second language (L2) acquisition of neural language models (LMs), shifting focus from the typical study of their first language (L1) acquisition. The core research question is: How does the L1 acquisition of an LM affect the efficiency and nature of its subsequent grammar acquisition in an L2? The study designs a human-like L2 learning scenario for bilingual LMs, pretraining them on an L1 (French, German, Russian, Japanese) before exposing them to English as an L2. The goal is to analyze cross-lingual transfer from a linguistic perspective, using grammatical judgment tests to evaluate syntactic generalization, moving beyond holistic metrics like perplexity.
The experimental pipeline mimics a human L2 learning trajectory with controlled data exposure.
A masked language model (e.g., based on architectures like BERT) is pretrained from scratch on a monolingual corpus of a chosen L1. This phase establishes the model's initial linguistic "native" competence.
The L1-pretrained model is then further trained (fine-tuned) on a limited English (L2) corpus. The study explores different data conditions: L2 monolingual texts only, or a mix of L1-L2 parallel translation pairs, with training data size restricted to simulate realistic human L2 input.
The model's L2 linguistic knowledge is probed using the BLiMP benchmark (The Benchmark of Linguistic Minimal Pairs). BLiMP tests specific grammatical phenomena (e.g., subject-verb agreement, filler-gap dependencies) by having the model choose between a grammatical and an ungrammatical sentence pair, providing a fine-grained analysis of syntactic generalization.
Initial experiments compared how different L2 training data configurations affect acquisition speed and quality.
Training solely on L2 monolingual texts every two epochs led to faster L2 grammar acquisition compared to more complex settings.
Interestingly, feeding L1-L2 translation pairs to the LM during L2 training slowed down the acquisition of L2 grammatical knowledge. This suggests that explicit parallel alignment might introduce noise or a conflicting learning signal for pure syntactic generalization in the early stages of L2 learning for LMs.
The core findings reveal significant effects of L1 on L2 acquisition in LMs.
Models with L1 pretraining achieved better performance on the English BLiMP benchmark after L2 exposure compared to models trained on English from scratch with equivalent data. This indicates that prior linguistic knowledge, even from a different language, provides a useful inductive bias for learning new grammatical structures.
The transfer efficacy varied by L1. Models with French or German as L1 showed stronger L2 (English) generalization than those with Russian or Japanese as L1. This aligns with human language learning difficulty rankings (e.g., Chiswick & Miller, 2004), where linguistic proximity (e.g., shared Germanic roots for English/German) facilitates transfer.
The boost from L1 pretraining was most pronounced for morphological (e.g., verb conjugation) and syntactic (e.g., word order) items. Gains were smaller for purely semantic items or those requiring integration of syntax and semantics. This suggests L1 knowledge primarily aids in acquiring formal structural rules of the L2.
The acquisition of L2 knowledge was found to be data-inefficient. Performance improved significantly only after the model had been exposed to the entire limited L2 dataset many times (e.g., 50-100 epochs), unlike humans who can generalize from fewer examples.
During L2 training, the model's performance on its original L1 tasks degraded. This phenomenon, analogous to "catastrophic forgetting" in continual learning, highlights a key difference from balanced human bilingualism and points to the need for techniques to maintain linguistic knowledge balance.
The core of the LM is based on the Transformer architecture and the masked language modeling (MLM) objective. During L1 pretraining, the model learns by predicting randomly masked tokens $w_t$ in a sequence $\mathbf{x} = (w_1, ..., w_T)$ based on their context. The objective is to maximize the log-likelihood:
$$\mathcal{L}_{MLM} = \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \sum_{t \in M} \log P(w_t | \mathbf{x}_{\backslash t}; \theta)$$
where $M$ is the set of masked positions, $\mathcal{D}$ is the L1 corpus, and $\theta$ are model parameters. During L2 acquisition, this objective is applied to the L2 corpus $\mathcal{D}_{L2}$, starting from parameters $\theta_{L1}$ fine-tuned to $\theta_{L1+L2}$. The grammatical judgment on BLiMP uses the model's relative probability scores for a minimal pair $(s_{grammatical}, s_{ungrammatical})$:
$$P(s_{grammatical}) > P(s_{ungrammatical})$$
where $P(s) = \prod_{t=1}^{T} P(w_t | w_{
Figure 1 (Experimental Procedure Diagram): The diagram visually outlines the three-stage pipeline. From left to right: 1) Multiple boxes labeled "LM in Fr," "LM in Ge," etc., representing different L1 models after pretraining. 2) An arrow labeled "Exposure to L2 (English)" points from these models to a central box containing the text "Corpus" and the BLiMP benchmark icon. 3) Another arrow labeled "Test L2 knowledge" points from the central box to a final box showing the evaluation outcome "Aa" (likely representing accuracy scores). The diagram effectively communicates the comparative setup where models with different L1 bases are subjected to the same L2 learning and evaluation regimen.
Key Result Visualization (Implied): While not explicitly graphed in the provided text, the results would typically be presented in bar charts or line graphs showing: 1) BLiMP accuracy scores for English (L2) on the y-axis, grouped by the L1 of the model (French, German, Russian, Japanese) on the x-axis, clearly showing the French/German advantage. 2) A line graph showing L2 accuracy (y-axis) over training epochs/iterations (x-axis) for different L1 models, demonstrating the slow, data-inefficient learning curve. 3) A grouped bar chart showing accuracy gains from L1 pretraining for different BLiMP sub-categories (Morphology, Syntax, Semantics, etc.), highlighting the larger gains for formal syntactic phenomena.
Case Study: Analyzing L1-L2 Transfer for Subject-Verb Agreement
1. Phenomenon: English requires verb inflection to agree with the number of the subject (e.g., "The dog runs" vs. "The dogs run").
2. L1 Influence Hypothesis: An LM pretrained on French (which has rich subject-verb agreement) may have a stronger latent representation for the concept of "agreement" between sentence elements compared to an LM pretrained on Japanese (which lacks verb conjugation for number). This abstract structural bias could facilitate learning the specific realization of this rule in English.
3. Testing with BLiMP: The model is presented with minimal pairs like:
Grammatical: The key to the cabinets *is* on the table.
Ungrammatical: The key to the cabinets *are* on the table.
The model must assign a higher probability to the grammatical sentence.
4. Expected Result: The French-L1 model is predicted to achieve higher accuracy on this BLiMP subset earlier in L2 training than the Japanese-L1 model, demonstrating positive transfer of an abstract grammatical concept.
5. Framework Application: This case can be formalized by probing the model's internal representations (e.g., using diagnostic classifiers) after L1 training to see if a "number agreement" detector can be trained more easily from the French-L1 model's embeddings. Then, tracking the performance curve on English agreement during L2 training quantifies the transfer benefit.
Core Insight
This paper isn't just another incremental NLP study; it's a bold, necessary pivot from treating LMs as monolithic "language" processors to viewing them as simulated cognitive systems with a developmental trajectory. The core insight is that an LM's "native language" fundamentally sculpts its learning biases, making cross-lingual transfer not a free bonus but a structured, predictable, and uneven process. The finding that parallel data can hinder syntactic acquisition is a bombshell for standard multilingual training dogma, suggesting that early-stage L2 learning in machines, like in humans, might benefit more from immersive, monolingual exposure than from explicit translation exercises.
Logical Flow
The authors' logic is admirably clean: 1) Isolate the variable (L1 identity) while controlling for architecture and L2 data. 2) Use a linguistically-grounded evaluation (BLiMP) instead of task-specific fine-tuning, which often conflates linguistic knowledge with task-specific heuristics. 3) Compare to human benchmarks (language difficulty rankings), providing a crucial external validation point often missing in pure ML research. This methodological rigor allows them to move from correlation (L1 affects L2 performance) towards a mechanistic hypothesis (abstract structural knowledge transfers).
Strengths & Flaws
Strengths: The study's primary strength is its interdisciplinary bridge-building. By framing the problem in terms of SLA theory, it generates hypotheses that are novel to NLP (e.g., testing differential transfer across grammatical phenomena). The controlled, human-scale data setting is a refreshing counterpoint to the "more data is always better" paradigm, forcing the models to generalize, not memorize.
Critical Flaws: The elephant in the room is scale. The experiments are conducted with relatively small LMs. As highlighted by the "Scaling Laws" research from OpenAI and others, model behavior can change dramatically with size. Does the French-L1 advantage hold for a 500B parameter model, or does sheer capacity overwhelm inductive bias? Furthermore, the focus on syntax via BLiMP, while precise, ignores the vast terrain of semantic and pragmatic transfer, which are equally critical for fluency. The observed catastrophic forgetting of L1 also points to a fundamental architectural limitation compared to the neuroplasticity of the human brain.
Actionable Insights
For practitioners, this research offers a blueprint for strategic pretraining. Don't just pretrain on a random soup of languages. If the target is high-performance in language X, first pretrain on its closest linguistic relatives to bootstrap structural learning. For researchers, the agenda is clear: 1) Scale up the experiments to modern LLM sizes to test the robustness of these findings. 2) Integrate continual learning techniques from the start to combat L1 degradation—this is no longer a niche problem but central to building stable multilingual agents. 3) Develop more comprehensive linguistic benchmarks that go beyond minimal pairs to include discourse coherence and pragmatic appropriateness, perhaps drawing from frameworks like the Common European Framework of Reference for Languages (CEFR). Ultimately, this work shifts the goal from building models that know languages to building models that learn them in a human-like way—a far more ambitious and intellectually rich pursuit.