Outlier
I was part of a Reinforcement-Learning-from-Human-Feedback (RLHF) effort to teach a general-purpose LLM to solve mathematics problems posed in Swedish. My work covered the full RLHF loop: (1) authoring and curating ~10 k prompt/solution pairs spanning basic arithmetic through university-level calculus and linear algebra; (2) rating model answers for logical correctness, step-by-step reasoning, clarity, and linguistic fluency; and (3) writing high-quality reference solutions that were used to train the reward model and perform SFT (supervised fine-tuning).