Why this matched
Title matches your search • Remote-friendly listing • Direct apply link
Get weekly matches like thisYou'll have direct influence on the models we ship.</p><p style="min-height:1.5em"></p><h2><strong>Your Responsibilities</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em"><strong>Co-Own data…
Read full description
You'll have direct influence on the models we ship.</p><p style="min-height:1.5em"></p><h2><strong>Your Responsibilities</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em"><strong>Co-Own data pipelines end-to-end:</strong> Design, build, and maintain the infrastructure that sources, processes, deduplicates, filters, and prepares pre-training corpora at scale. Monitor pipeline health and data quality metrics at scale.</p></li><li><p style="min-height:1.5em"><strong>Close data gaps:</strong> Work with evaluation and post-training teams to identify where model weaknesses trace back to data coverage, then source or generate the data needed to address them.</p></li><li><p style="min-height:1.5em"><strong>Collaborate with post-training:</strong> Partner closely with the post-training team to ensure pre-training data decisions support downstream fine-tuning, alignment, and deployment goals - data choices upstream shape what's possible downstream.</p></li><li><p style="min-height:1.5em"><strong>Co-Own German-language data:</strong> Ensure deep, high-quality coverage of German-language corpora - this is core to our value proposition, not an afterthought.</p></li><li><p style="min-height:1.5em"><strong>Establish data-to-performance signal:</strong> Design and run ablation studies to validate data choices - measuring how changes in composition, filtering, or sourcing affect pre-training evaluation metrics and downstream capabilities.</p></li><li><p style="min-height:1.5em"><strong>Take data transparency seriously</strong>: Maintain data lineage and provenance so the team knows exactly what went into each training run.</p></li></ul><h2><strong>Your Profile</strong></h2><h2><strong>Basic Qualifications</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Track record of shipping impactful technical work - whether that's research, infrastructure, or both.</p></li><li><p style="min-height:1.5em">Strong Python skills and comfort with data engineering and ML infrastructure, including experience with deep learning frameworks, workflow orchestration, object storage, columnar data formats, and distributed processing.</p></li><li><p style="min-height:1.5em">Ability to reason about what a dataset contributes to model training and whether it matters - not just process data, but understand it.</p></li><li><p style="min-height:1.5em">Ownership mentality: you see problems through from diagnosis to solution to deployment.</p></li><li><p style="min-height:1.5em">Willingness to relocate to Heidelberg or travel at least fortnightly.</p></li></ul><h2><strong>Preferred Qualifications</strong></h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Experience with large-scale data processing for ML, including corpus sourcing, curation, cleaning, deduplication, and filtering.</p></li><li><p style="min-height:1.5em">Familiarity with data quality methods: classifier-based filtering, heuristic scoring, perplexity-based selection, and decontamination.</p></li><li><p style="min-height:1.5em">Understanding of foundation model training - how data composition, scale, and mixing ratios affect capabilities.</p></li><li><p style="min-height:1.5em">Experience with web-scale data sourcing and crawl processing (e.g., Common Crawl, WARC pipelines).</p></li><li><p style="min-height:1.5em">Rust proficiency (parts of our data pipeline are performance-critical).</p></li><li><p style="min-height:1.5em">Infrastructure knowledge - experience with Kubernetes, container orchestration, or cloud-native ML infrastructure.</p></li><li><p style="min-height:1.5em">PhD in machine learning, NLP, data engineering, or a related field (valued but not required - we care about what you can do).</p></li><li><p style="min-height:1.5em">Bonus, but not required: German language proficiency can be helpful for curating and assessing German-language data.</p></li></ul><h2>Compensation and Benefits</h2><ul style="min-height:1.5em"><li><p style="min-height:1.5em">Become part of an AI revolution!</p></li><li><p style="min-height:1.5em">30 days of paid vacation</p></li><li><p style="min-height:1.5em">Access to a variety of fitness & wellness offerings via <a target="_blank" rel="noopener noreferrer" class="theme markdown__link" href="https://wellhub.com/de-de/">Wellhub</a></p></li><li><p style="min-height:1.5em">Mental health support through <a target="_blank" rel="noopener noreferrer nofollow" href="http://nilo.health">nilo.health</a></p></li><li><p style="min-height:1.5em">Substantially subsidized company pension plan for your future security</p></li><li><p style="min-height:1.5em">Subsidized Germany-wide transportation ticket</p></li><li><p style="min-height:1.5em">Budget for additional technical equipment</p></li><li><p style="min-height:1.5em">Flexible working hours for better work-life balance and hybrid working model</p></li><li><p style="min-height:1.5em">Virtual Stock Option Plan</p></li><li><p style="min-height:1.5em"><a target="_blank" rel="noopener noreferrer" class="theme markdown__link" href="https://www.jobrad.org/">JobRad®</a> Bike Lease</p></li></ul><div style="text-align:left"><img style="max-width:100%" src="https://app.ashbyhq.com/api/images/user-content/aa99046a-07f4-421d-8d6a-afe18d87eb09/a6a3f816-65cc-4590-aa69-685dcbbeac0b/image.png" /></div><ul style="min-height:1.5em"><li><p style="min-height:1.5em"></p></li></ul>