Data Curation Intern

Karya

Karya

Data Science

Bengaluru, Karnataka, India

Posted on May 11, 2026

About Karya:

Why was Karya on the cover of the Time Magazine , highlighted by Satya Nadella , and invited to present its work to Sundar Pichai one on one?
In part, because Karya is on a mission to provide AI enabled earning and learning opportunities to communities with high talent, but low access to opportunities. Karya achieves this while also delivering high quality, timely, and price competitive data to its clients.
Karya builds high quality datasets for large companies like Google and Microsoft, while providing ethical work opportunities and fair wages to its workforce.
Karya’s workers make nearly 20 times the Indian minimum wage and through our one-of-a-kind digital work platform, we have delivered over 40 million digital tasks and have positively impacted over 100 thousand workers. In the coming years, our goal is to rapidly scale our impact by bringing economic opportunities to millions of underserved users in India. With a rapidly growing global presence, we are also looking to expand our client base in the Indian market by partnering with leading Indian enterprises.

About the Role

We are looking for a detail-oriented and curious Data Curation Intern to help build high-quality datasets for training AI/ML models with a specific focus on Indian language and multilingual data. You will work with large open-source datasets (e.g., Sangraha by AI4Bharat) that require significant cleaning, structuring, and enrichment before they can be used effectively in model training pipelines.

This is a hands-on, high-impact role at the intersection of data engineering, linguistics, and AI. You will start with text data pipelines and progressively move toward preparing data for read-speech and voice model training.


What You'll Do

Phase 1: Text Data Curation
Audit and profile open-source datasets (Sangraha, Common Crawl, IndicCorp, etc.) to assess quality, coverage, and noise levels
Design and implement data cleaning pipelines: deduplication, script normalisation, encoding fixes, noise removal, sentence boundary detection
Create and apply metadata tagging schemas labelling text by domain (news, legal, literature, health, etc.), subdomain, language, register, and quality tier
Build validation checklists and quality scorecards to benchmark dataset readiness for model training
Document data provenance, licensing, and processing steps for reproducibility

Phase 2: Speech & Voice Data Preparation
Curate high-quality, phonetically diverse text passages suitable for read-speech recording
Ensure text selection covers domain, prosodic, and phonemic variety required for TTS/ASR model training
Assist in defining metadata standards for audio datasets (speaker demographics, recording conditions, transcription format)
Support the pipeline transition from text corpus to aligned speech dataset


What We're Looking For

Must Have
Strong attention to detail — you notice inconsistencies others miss
Comfort with Python for data processing (pandas, regex, basic NLP libraries like spaCy or NLTK)
Familiarity with text data formats: CSV, JSONL, Parquet, plain text corpora
Curiosity about AI/ML, language technology, or computational linguistics
Ability to work independently, document work clearly, and communicate blockers early
Good to Have
Prior exposure to NLP datasets or open-source language resources (IndicNLP, AI4Bharat, Hugging Face datasets)
Knowledge of one or more Indian languages beyond English
Experience with data versioning tools (DVC, Git-LFS) or dataset platforms (Hugging Face Hub)
Basic understanding of how language models or speech models are trained


Why This Role

Work directly on real data pipelines that feed AI model training — not toy projects
Gain hands-on experience with large-scale multilingual and Indic language datasets
Build skills that are in high demand across AI labs, speech companies, and NLP startups
Clear progression path: text → read speech → voice data, with increasing responsibility
Mentorship from people who have built data and AI systems at scale


Karya celebrates diversity and is an equal opportunity employer. All applicants will be considered without regard to race, religion, gender identity, sexual orientation, disability, or any other protected status.