Nepali Grammar Correction (GEC) — Exact Architecture & Results

This page mirrors the paper & thesis. All figures, counts and flows match their sources.

Overview

The Nepali GEC system is a two-stage pipeline: (1) a BERT-based Grammatical Error Detection (GED) model labels a sentence as correct or incorrect; (2) for incorrect inputs, a Masked Language Model (MLM) (MuRIL or NepBERTa) generates alternative sentences, which are filtered again by the GED to surface only valid suggestions. This design exactly follows the process shown in the paper’s Fig. 1–2 and the thesis Figures 3.1–3.2.

Parallel pairs
8,130,496
Error types
7
Best GED accuracy (MuRIL)
91.15%
Classifier batch size
256

Block Diagram — GED (Sequence Classification)

GED flow
Mirrors paper Fig. 1 and thesis Figure 3.1 exactly; the classifier is BERT (MuRIL/NepBERTa) + a dense layer predicting correct vs incorrect.
Source: Figures described in the paper and thesis.

Block Diagram — GEC Engine (MLM + GED filter)

GEC flow
Matches paper Fig. 2 and thesis Figure 3.2, including both masking strategies and the GED-based filtering.
Source: Figures described in the paper and thesis.

Corpus — Exact Statistics

The parallel corpus contains 8,130,496 source–target pairs across seven error types. Distribution:

Error Type Instances Percent
Verb inflection 3,202,676 39.39%
Pronouns (subject missing) 316,393 3.89%
Sentence structure 1,001,038 12.31%
Auxiliary verb missing 1,031,388 12.69%
Main verb missing 1,031,274 12.68%
Punctuation errors 1,044,203 12.84%
Homophones 503,524 6.20%
Total 8,130,496 100%

Counts and percentages reproduced verbatim from the paper’s Table VIII.

Dataset splits used for GED fine-tuning

Split Correct sentences Incorrect sentences Total sentences
Train 2,568,682 7,514,122 10,082,804
Validation 365,606 405,905 771,511

Training Setup (GED)

  • MuRIL: 1 epoch; AdamW (lr=5e-5, β1=0.9, β2=0.999, ε=1e-8); batch size 256; loss: Cross-Entropy.
  • NepBERTa: 2 epochs; same optimizer and batch size.
  • Trainable parameters — MuRIL: 237,557,762; NepBERTa: 109,514,298.
  • Tokenizer vocabulary size — MuRIL: 197,285; NepBERTa: 30,523.

Results (GED)

Model Training Loss Validation Loss Accuracy
MuRIL 0.2427 0.2177 91.15%
NepBERTa 0.2776 0.3446 81.73%

Processing time for MLM suggestions grows roughly linearly with sentence length (see paper Fig. 3 and Fig. 4).

Implementation Notes

Original PNG Diagrams (for reference)

GED flow (PNG)
GED — provided PNG (kept for parity with the thesis/paper figures).
GEC flow (PNG)
GEC — provided PNG (kept for parity with the thesis/paper figures).

Models & Datasets (Hugging Face)

Resources

PDFs are bundled alongside this page for offline reading.

Citation

When referencing this work, cite the peer‑reviewed INSPECT 2024 paper (BibTeX below).

@inproceedings{aryal2024bert,
title={BERT-Based Nepali Grammatical Error Detection and Correction Leveraging a New Corpus},
author={Aryal, Sumit and Jaiswal, Anku},
booktitle={2024 IEEE International Conference on Intelligent Signal Processing and Effective Communication Technologies (INSPECT)},
pages={1--6},
year={2024},
organization={IEEE}
}

Provenance & Alignment

The block diagrams above reproduce the flows from the paper’s Fig. 1–2 (GED and end-to-end GEC) and the thesis Figures 3.1–3.2. The corpus statistics table equals Table VIII; the dataset split equals Table IX; the results equal Table XI.