Karya sits at the intersection of artificial intelligence and livelihoods.
In a sector often criticised for invisible labour and extractive economics, this Bengaluru-based organisation has tried to flip the model.
Instead of treating data workers as a disposable backend to AI systems, Karya builds its entire value proposition around them. The result is a company that looks less like a typical AI startup and more like an economic redesign layered on top of data infrastructure.
Origins
Karya’s story begins not as a startup pitch, but as a research experiment. The idea first took shape around 2017 within Microsoft Research India, where researchers were exploring how to build high-quality datasets in Indian languages while engaging local communities.
This early work revealed two structural gaps. First, India’s linguistic diversity was barely represented in global AI systems. Second, the process of creating training data—if designed differently—could become a meaningful source of income.
By 2021, the project was spun out into an independent organisation called Karya. It was founded by Manu Chopra and Vivek Seshadri, with Safiya Husain joining later in a key role. From the beginning, the ambition was not just to build datasets, but to rethink who benefits from the creation of AI.
The model
Karya describes itself as an ethical data company, a phrase that becomes clearer when you look at how it operates. At its core, it builds datasets for AI systems—speech recordings, text annotations, translations, and increasingly more complex evaluation tasks. The focus is on low-resource languages, which are widely spoken but poorly represented in digital systems.
Workers, often from rural or low-income communities, use a mobile platform to complete tasks in their native languages. Someone might record spoken Kannada sentences, transcribe audio, or label images relevant to agriculture or healthcare. These are the same kinds of tasks that power modern AI systems, but Karya changes the structure around them.
The most striking difference is compensation. Workers are paid significantly higher than typical microtask platforms, in some cases multiple times local minimum wages. In a sector where labour is often undervalued, this is a deliberate design choice. There have also been experiments with sharing value, where contributors benefit when datasets are reused.
Another notable feature is how revenue flows. A large share of earnings is passed directly to workers rather than being absorbed as margin. This changes the economics of the system. Instead of optimising for the lowest possible labour cost, the model is built around fair distribution.
The platform has already reached scale, engaging tens of thousands of workers and completing millions of tasks across languages and domains. Importantly, the work is distributed—there is no need for workers to relocate or join formal offices, which expands access significantly.
Growth, funding, and structure
Karya’s growth path stands apart from many AI startups. It has largely avoided traditional venture capital and instead built revenue through contracts with technology companies and research organisations. This has allowed it to maintain control over its model without the pressure to optimise purely for rapid scale.
By the mid-2020s, the organisation had grown into a meaningful operation, with a relatively lean team and strong revenue traction. Clients include global technology firms and institutions that require high-quality, diverse datasets.
There are also signs of a hybrid structure. While there is a commercial arm generating revenue, the broader mission remains development-focused. This dual approach allows Karya to operate in competitive markets while staying aligned with its original purpose.
What makes the product different
At a surface level, Karya provides data collection and annotation services. But the deeper innovation lies in how that data is sourced and structured.
One key aspect is linguistic depth. Instead of relying on standardised datasets, Karya captures real speech patterns, dialects, and code-switching across Indian languages. This matters because most AI systems struggle not with formal language, but with how people actually speak in everyday life.
Another layer is context. Karya has explored first-person datasets that reflect how individuals experience the world. This becomes important in sectors like healthcare or agriculture, where context shapes meaning. For example, how a farmer describes a crop disease may vary widely based on region, language, and lived experience.
There is also a strong focus on human evaluation. As AI systems grow more complex, evaluating their outputs—checking for bias, accuracy, and cultural sensitivity—becomes critical. Karya’s distributed workforce enables this kind of evaluation at scale, with inputs grounded in real communities.
Finally, the delivery model matters. By being mobile-first, Karya removes many barriers to participation. Workers do not need advanced infrastructure or formal training to get started, which broadens inclusion.
Pilots, performance, and market feedback
Karya’s datasets have been used across sectors such as agriculture, healthcare, and language technology. They support speech recognition systems, translation tools, and conversational AI models. For global companies, one of the hardest problems is ensuring that systems work across languages and cultures. Karya’s work helps bridge that gap.
Feedback from partners tends to centre on quality and diversity. The datasets are not just large; they are grounded in real-world usage, which improves performance in practical scenarios.
On the worker side, the impact is more direct. Participants often report earning more than they would in traditional local jobs. For many, this is not just supplementary income but a meaningful shift in earning potential.
At the same time, challenges remain. Work availability depends on client demand, which can fluctuate. Scaling the model requires a steady pipeline of projects and careful management of workforce expectations. There is also the broader question of whether such high compensation levels can be sustained as more players enter the space.
Comparable models in India and beyond
Karya operates in a space that overlaps with both traditional data-labelling firms and newer AI infrastructure companies. Platforms like Scale AI and Amazon Mechanical Turk offer similar services at scale, but their labour models are very different, often prioritising efficiency over worker earnings.
In India, companies such as Fractal Analytics operate further up the value chain, focusing on enterprise AI solutions rather than data creation itself. Meanwhile, newer players like Sarvam AI are building language models directly, highlighting how the ecosystem is expanding in multiple directions.
There is also a growing global movement around ethical AI and data cooperatives. These initiatives aim to give contributors more control and a greater share of value. Karya is one of the more visible examples of this approach, particularly in emerging markets.
A global lens
The rise of generative AI has dramatically increased the demand for high-quality data. Language models, speech systems, and computer vision tools all depend on large, diverse datasets. At the same time, there is increasing scrutiny of how this data is created.
Questions around bias, representation, and labour conditions are becoming central to discussions about AI. Models trained on narrow datasets produce narrow outcomes. Systems built on poorly compensated labour raise ethical concerns.
This is where organisations like Karya fit into a broader shift. They treat data not just as a technical input, but as an economic layer that can be designed differently. The idea is simple but powerful: if data is valuable, the people who create it should share in that value.
Across regions such as Africa, Southeast Asia, and Latin America, similar approaches are beginning to emerge. Many focus on local languages and communities, recognising that inclusion in AI starts with inclusion in data.
- Our correspondent
