Karya: Quiet bet on fair AI work – GoodTechGo

Artificial intelligence systems depend heavily on data. For models to work reliably, especially in language and speech, they need large volumes of labeled and verified inputs. In many cases, this data is created through centralized teams or outsourced to global annotation firms.

Karya takes a different approach by distributing this work to rural communities and structuring it as a paid digital task system.

Origin

Karya was founded by Manu Chopra, a computer scientist and researcher with a background in AI and distributed systems. He has been associated with Microsoft Research India, where his work focused on using technology to solve problems in low-resource settings.

His work at Microsoft Research exposed a recurring gap in AI systems: most models were trained on data from high-income, English-speaking contexts, which limited their effectiveness in regions like India. At the same time, he observed that rural communities had the potential to contribute to data creation if the work was structured appropriately.

Karya was built around this intersection. Karya operates at the data layer of AI. It does not build end-user AI products. Instead, it focuses on creating high-quality datasets that can be used to train models, particularly for languages and contexts that are underrepresented in existing data.

The organization was founded with the idea that data work can be both a technical input and a source of income. In many parts of India, there is limited access to formal employment, but widespread access to mobile phones. This creates an opportunity to deliver structured digital work that can be completed remotely.

Product

At its core, Karya provides a platform where tasks are delivered to workers through mobile interfaces. These tasks are designed to be simple and clearly defined. They can include activities such as recording speech, transcribing audio, labeling images, or validating text. Each task contributes to building a dataset that is later used in machine learning models.

One of the key areas of focus for Karya is language data. Many AI systems perform well in widely used languages but struggle with regional languages that have less available data. By collecting speech and text data from native speakers, Karya helps create datasets that reflect real-world usage.

The process begins with defining the dataset requirements. This could involve specifying the type of data needed, such as conversational speech or domain-specific vocabulary. Tasks are then designed to collect or annotate this data in a structured way.

Workers access these tasks through a mobile application or interface. Instructions are provided in a way that is easy to understand, even for users with limited technical background. For example, a speech collection task might ask a user to read or repeat a set of phrases into their phone.

Once tasks are completed, the data is reviewed and validated. Quality control is an important part of the system because the usefulness of a dataset depends on its accuracy. Karya incorporates verification steps to ensure that the collected data meets required standards.

The platform is designed to operate at scale. By distributing tasks across a large number of workers, it can generate significant volumes of data in a relatively short time. This distributed approach also allows data to be collected from diverse linguistic and regional contexts.

One of the distinguishing aspects of Karya is its focus on fair compensation. Workers are paid for the tasks they complete, and the system is structured to provide meaningful income rather than micro-payments. This positions the platform not just as a data provider but also as a livelihood model.

In practice, Karya has been used to build datasets for applications such as speech recognition and language processing. These datasets are particularly valuable for Indian languages, where high-quality labeled data is limited.

The platform also addresses the challenge of accessibility. Tasks are designed to work on basic smartphones, and interfaces are adapted to local languages. This reduces barriers to participation and allows a wider range of users to contribute.

From a technical perspective, the system needs to manage task distribution, data collection, validation, and aggregation. Each of these steps is part of a pipeline that transforms individual contributions into a usable dataset.

How is it different?

Karya’s positioning is different from traditional data annotation companies. Instead of relying on centralized teams, it builds a distributed workforce model. This has implications for both cost and diversity of data.

Globally, there is increasing demand for high-quality datasets as AI systems become more widespread. At the same time, there is growing awareness of the need for inclusive data that represents different languages and communities.

Karya fits into this context by focusing on regions that are often underrepresented in global datasets. By collecting data directly from these communities, it helps create models that are more relevant and accurate.

The approach also highlights a broader shift in how data is sourced. Instead of treating data as a byproduct, it is treated as a resource that can be generated through structured processes. This requires systems that can coordinate large numbers of contributors and maintain quality.

Deployment

In terms of deployment, Karya works with organizations that require datasets for training AI models. This includes companies building language technologies as well as research institutions.

The impact of the platform can be seen at two levels. At the technical level, it contributes to better-performing AI systems by providing relevant data. At the social level, it creates opportunities for income in areas where such opportunities are limited.

The effectiveness of the model depends on maintaining a balance between scale and quality. As the number of workers increases, ensuring consistent output becomes more challenging. This requires robust validation mechanisms and continuous monitoring.

Karya represents a specific layer in the AI ecosystem that is often overlooked but essential. Without high-quality data, even the most advanced models cannot perform effectively. By focusing on this layer, the organization addresses a fundamental requirement of AI development.

The model also suggests a way to align technological needs with economic opportunities. By linking data creation with distributed work, it creates a system where both the technology and the contributors benefit.

As AI systems expand into more languages and contexts, the need for such approaches is likely to grow. Platforms that can generate reliable, diverse datasets at scale will play an important role in this expansion.

Karya’s work sits at this intersection of data, technology, and livelihoods, focusing on how the raw material of AI is created and who participates in that process.

Our correspondent

You may also like