This paper reproduces examples of harmful content — including hate speech, racism, sexism, and harassment — for research and educational purposes. Reader discretion is advised.
Introduction
Content moderation at scale is an energy and accuracy problem. Large language models have attracted attention as a moderation tool, but their inference costs are severe. A platform the size of Reddit, which recorded 5 billion user-generated posts in the first half of 2025, would consume roughly 5 GWh on LLM-based moderation alone. That is enough electricity to power approximately 476,000 homes for a year.
Small, task-specific transformer models offer a practical alternative. They run fast, consume a fraction of the compute, and can be fine-tuned directly against a platform's own harm definitions. The open question is how to train them well, particularly on the kind of context-dependent language where intent shifts depending on the target of a statement.
We investigate curriculum learning as one answer to that question. The idea: train models first on synthetic examples with clear, unambiguous harm patterns, then gradually introduce real-world data. We compare five encoder-only transformer architectures trained with and without curriculum learning against a prompt-engineered Mistral-7B, measuring accuracy, precision, recall, and energy consumption across binary (safe/unsafe) and five-way harm classification tasks.
Dataset Construction
Three public datasets formed the base: the Cyberbullying dataset (115,661 samples, 6 classes), Hate Speech Offensive Tweets (24,000 samples, 3 classes), and a Reviews dataset (49,581 samples, 2 classes). The reviews corpus was included specifically to correct for lexical bias in the other two.
Lexical analysis of the cyberbullying and hate speech datasets revealed that words like "hate" and "stupid" appeared in 100% toxic contexts. A model trained on that distribution will flag "I hate broccoli" as harmful. To fix this, 1,000 movie review samples were added, providing negative language in clearly safe contexts.
All samples were relabelled to fit a six-category governance model: Safe, Toxic, Hate, Racist, Sexist, and Sexual. High-confidence examples were tagged via rule-based pattern matching; ambiguous ones went through local Mistral-7B inference. The resulting dataset contains approximately 16,600 curated, class-balanced examples.
To validate the LLM labeller, the same prompt was applied to a held-out benchmark set and compared against a human moderator. Agreement was above 95% for binary classification. For the five-way problem it fell to around 77%, reflecting genuine label ambiguity at category boundaries rather than outright error.
Harm Category Definitions
| Category | Definition | Example |
|---|---|---|
Safe |
Neutral or non-harmful content, including negative opinions about objects or ideas that do not target individuals or groups. | "I hate broccoli" / "This movie was terrible" |
Toxic |
Hostile language attacking individuals. Personal insults, aggressive profanity, attacks on competence. | "You're an idiot" / "Shut up, moron" |
Hate |
Attacks on groups based on religion, nationality, ethnicity, disability, age, or caste (excludes race, gender, sexuality). | "Muslims are terrorists" / "Refugees should go back" |
Racist |
Attacks based on race, skin color, or racial identity. Slurs, claims of racial superiority, promotion of segregation. | "White supremacy is correct" / racial slurs |
Sexist |
Attacks based on gender or gender identity. Misogyny, misandry, transphobia, objectification. | "Women belong in the kitchen" / "Trans women aren't real women" |
Sexual |
Unwanted sexual content, harassment, or solicitation, including any content sexualizing minors. | "Send me nudes" (unsolicited) / explicit coercion |
Curriculum Learning Dataset
A separate synthetic dataset was built to teach contextual patterns explicitly. The vocabulary covers 50 negative words, 40 positive words, 20 profanity variants, 150 safe objects (safe to criticize), 45 protected identity groups, and 20 person pronouns. Five sentence patterns were generated per category, 1,000 samples each, designed so the same offensive language flips classification based solely on its target.
| Pattern | Safe | Unsafe |
|---|---|---|
| Hate expression | "I hate broccoli" | "I hate women" |
| Stupidity claims | "That's a stupid movie" | "Women are stupid" |
| Profanity usage | "This fucking movie sucks" | "Fuck you" |
| Ideological critique | "I disagree with Islam" | "Muslims are terrorists" |
| General expression | "I love this" | "Kill yourself" |
Model Architectures
Five encoder-only transformer models were evaluated. All share the bidirectional self-attention mechanism from the original BERT architecture but differ in pre-training strategy, parameter efficiency, and size.
- RoBERTa (124.6M params): Identical architecture to BERT, retrained with dynamic masked language modelling, larger batches, and more data. No next-sentence prediction objective. Produces strong general-purpose contextual embeddings.
- ALBERT (11.7M params): Parameter-efficient BERT variant using cross-layer weight sharing and factorised embeddings. Smallest model in the set. Substitutes sentence-order prediction for next-sentence prediction.
- DistilBERT (67M params): A 6-layer distilled BERT trained to mimic the outputs and internal representations of BERT-base. Reduced size and faster inference with minimal accuracy loss.
- DeBERTa (139.2M params): Disentangles content and position into separate attention matrices, then reintegrates absolute position at the decoding layer. Strongest contextual representation in this set.
- ELECTRA (109.5M params): Pre-trains a discriminator to detect replaced tokens across all positions rather than predict masked tokens in sparse positions. More sample-efficient than MLM-based approaches.
Fine-Tuning Setup
Each model was fine-tuned on two tasks. The binary task maps all five harm categories to label 1 and safe content to label 0 (16,599 training examples, 25%/75% safe-to-unsafe split). The multi-class task trains models to discriminate between the five harm categories on unsafe content only (13,196 examples).
Curriculum training ran in three stages. First, five epochs on the full training set with weighted cross-entropy loss. Second, three epochs on each of five progressively real data mixtures: 100% synthetic, 75% synthetic, 50/50, 25% synthetic, 100% real. Third, five additional epochs on purely real data. Weighted loss was applied throughout to handle class imbalance.
Benchmark Results
The benchmark used 1,000 samples held out from training and relabelled by a human moderator. An additional 97-sample contextual test set checked each model's ability to distinguish identical harmful language directed at people versus objects. All models were tested on the same data.
Performance: Precision and Recall
All five small models outperformed Mistral-7B on both precision and recall in binary and multi-class tasks. For the multi-harm problem, Mistral achieved precision of 0.7338 and recall of 0.7084; most fine-tuned models beat both figures while using a tiny fraction of the compute.
Recall is the priority metric here. A moderation system that misclassifies harmful content as safe is more dangerous than one that over-flags borderline text. On recall, RoBERTa led in both tasks. On precision, DeBERTa was best for binary classification and ALBERT for multi-harm. Curriculum-trained models consistently scored below their baseline counterparts on the real-world benchmark, a pattern discussed further below.
Energy Efficiency
Energy per 1,000 samples was calculated by multiplying classification time against GPU power draw, measured on an AMD RX6800 running Arch Linux.
| Model | Parameters | Task | Latency (s) | Energy (J) |
|---|---|---|---|---|
| RoBERTa | 124.6M | Binary | 8.14 | 1,220.94 |
| RoBERTa | 124.6M | Multi-Harm | 8.46 | 1,268.35 |
| ALBERT | 11.7M | Binary | 7.97 | 1,194.81 |
| ALBERT | 11.7M | Multi-Harm | 7.97 | 1,194.99 |
| DistilBERT | 67.0M | Binary | 4.27 | 640.77 |
| DistilBERT | 67.0M | Multi-Harm | 4.31 | 647.06 |
| ELECTRA | 109.5M | Binary | 7.71 | 1,156.64 |
| ELECTRA | 109.5M | Multi-Harm | 7.67 | 1,151.13 |
| DeBERTa | 139.2M | Binary | 14.05 | 2,106.99 |
| DeBERTa | 139.2M | Multi-Harm | 13.90 | 2,084.33 |
| Mistral-7B | 7,000M | Binary | 781.99 | 156,397.66 |
| Mistral-7B | 7,000M | Multi-Harm | 716.38 | 143,276.26 |
DistilBERT is the most energy-efficient architecture at 640-647 joules per 1,000 samples, despite having five times more parameters than ALBERT. That counterintuitive result points to the distillation process: knowledge distillation optimizes the network for throughput in a way that raw parameter compression does not. ALBERT at 11.7M parameters still consumes roughly 1,195 joules because its cross-layer weight sharing adds synchronization overhead.
Mistral-7B consumed 143,000-156,000 joules per 1,000 samples and required 716-782 seconds. That is 68-244 times more energy and 51-167 times slower than the small models. The root cause is architectural: encoder-only models need one forward pass to produce a classification. Mistral, as a decoder-only model, generates text autoregressively, token by token, through multiple forward passes. More compute, worse results.
Category-Level Performance
Baseline models achieved F1 scores between 0.7350 and 0.7524 across the five harm categories. Curriculum-trained models scored 0.6481 to 0.7346. The gap is meaningful and, for four of five architectures, statistically significant.
Performance varied sharply by category. Hate speech was easiest to detect, with F1 scores of 0.81-0.84 for baseline models. Racist and sexist content followed at 0.76-0.78. Toxic and sexual content proved harder: F1 scores of 0.63-0.71, with recall on toxic content falling to 0.50-0.55. Many toxic messages appear to be classified into the more specific categories (hate, racist, sexist) rather than correctly identified as general hostility.
| Architecture | Curriculum F1 | Baseline F1 | Delta F1 (%) | McNemar p-value | Significant |
|---|---|---|---|---|---|
| RoBERTa | 0.7346 | 0.7524 | -2.37 | 0.093 | No |
| ALBERT | 0.7221 | 0.7469 | -2.98 | 0.044 | Yes |
| DistilBERT | 0.7003 | 0.7350 | -4.67 | 0.011 | Yes |
| ELECTRA | 0.6481 | 0.7354 | -12.18 | <0.001 | Yes |
| DeBERTa | 0.6634 | 0.7354 | -8.90 | <0.001 | Yes |
McNemar's test was used throughout because it is designed for paired classifier comparisons on the same dataset. It focuses on disagreement cases, ignoring samples where both models agree, which makes it more sensitive to genuine performance differences than standard accuracy comparisons on dependent samples.
Curriculum Learning: What It Gets Right and Where It Fails
The controlled contextual tests told a different story. When asked to classify 220 targeted sentences designed around known patterns ("stupid [object]" vs. "stupid [person]"), curriculum-trained models achieved 93.64% average accuracy versus 78.36% for baseline models. Every model improved. Every improvement was statistically significant (p < 0.001).
| Architecture | Curriculum | Baseline | Improvement (%) |
|---|---|---|---|
| ELECTRA | 0.9500 | 0.7636 | +18.64 |
| DistilBERT | 0.9409 | 0.7682 | +17.27 |
| ALBERT | 0.9000 | 0.7455 | +15.45 |
| DeBERTa | 0.9455 | 0.8182 | +12.73 |
| RoBERTa | 0.9455 | 0.8227 | +12.27 |
| Average | 0.9364 | 0.7836 | +15.27 |
"Curriculum learning teaches the patterns cleanly. It does not teach the noise. Real-world data is mostly noise."
This is the core tension. Synthetic curriculum examples sit in a different feature space than natural user-generated text. The sentences are regular, the vocabulary is controlled, the intent is unambiguous. Real harmful content is none of those things. Models that train on clean patterns first learn decision boundaries that are too rigid for the messier distribution they encounter later.
Baseline training, by contrast, exposes models to label ambiguity and semantic overlap from the start. A message that is simultaneously hateful and racist, or toxic and sexist, appears in training. The model has to accommodate it. Curriculum training delays that accommodation, and the delay appears to cost generalization.
The problem is probably compounded by single-label annotation. When a sample could legitimately belong to two categories, the annotator picks one. Curriculum training, which begins with examples treated as definitively correct, may reinforce overconfidence in those assignments. Soft labels or multi-label formulations could help here.
Practical Implications
Three findings stand out for anyone building or evaluating a production moderation system.
First, general-purpose LLMs are not the right tool for classification tasks. Mistral-7B is roughly 50 times larger than DeBERTa, the biggest model tested, yet it underperforms on both precision and recall while consuming 68-75 times more energy per 1,000 samples. The architectural mismatch is fundamental: autoregressive generation is the wrong mechanism for a classification decision.
Second, parameter count does not predict efficiency within the small-model range. DistilBERT at 67M parameters is faster and more energy-efficient than ALBERT at 11.7M. Distillation produces throughput benefits that compression alone does not. For latency-constrained deployments, DistilBERT is the clear choice.
Third, curriculum learning should be evaluated carefully before deployment. It improves controlled contextual accuracy substantially and could be valuable in datasets with known lexical biases. But on real-world classification, four of five architectures degraded significantly. The method is not ready to replace standard fine-tuning as a default.
Across all architectures and tasks, RoBERTa delivered the strongest overall results. It led in recall — the stated priority metric — for both binary and multi-harm classification, and was the most robust to curriculum-related degradation, with the smallest F1 drop (−2.37%) that was not statistically significant. For deployments where energy efficiency is the primary constraint, DistilBERT remains the best choice, processing 232 texts per second at 640–647 joules per 1,000 samples.
The most practical near-term use of LLMs in this pipeline is data labelling. LLM agreement with human annotators exceeded 95% on binary classification. That is fast, scalable, and reduces direct human exposure to harmful content. Labelling is where LLM capacity maps naturally onto the problem.
Limitations
Several constraints bound these results. The benchmark was drawn from the same distribution as the training data, which means performance on fully out-of-domain content is unknown. The curriculum learning dataset was generated programmatically; more diverse augmentation might close the gap with baseline training on real data. Single-label annotation forces arbitrary choices at category boundaries, injecting label noise that affects all models. Finally, energy results are hardware-specific and were measured on a single GPU configuration.
Conclusion
Fine-tuned small transformer models are viable for production content moderation. They outperform Mistral-7B on accuracy while consuming a tiny fraction of the compute. Within that group, DistilBERT offers the best energy profile, not because it is the smallest, but because distillation optimizes for inference in ways parameter reduction alone does not.
Curriculum learning remains an open question. It works well for controlled pattern recognition. It does not yet generalize to the complexity of real user-generated content, at least not with the synthetic sentence generation approach tested here. Future work should examine authentic data augmentation, ordered curriculum strategies that progress from clear cases to edge cases, and multi-label architectures that handle the overlap between harm categories more honestly.
The path to efficient, accurate moderation at scale does not run through larger models. It runs through better-trained small ones.
Future Directions
Future work should explore ordered curriculum strategies that progress from unambiguous examples to edge cases rather than relying on pattern-based staging. Multi-label classification architectures could predict probability distributions across all harm categories simultaneously, avoiding forced single-label assignments. Authentic data augmentation through back-translation and paraphrasing may help maintain distributional alignment between training and evaluation data. Cascade architectures that combine fast binary classification with slower fine-grained categorization could further optimise the accuracy-efficiency trade-off. Finally, parameter-efficient fine-tuning methods such as LoRA and prefix tuning warrant investigation for reducing training costs while preserving the performance observed with full fine-tuning.