AI Singapore and Google partner to enhance Southeast Asian Large Language Model training datasets


AI Singapore and Google are working together to improve artificial intelligence that understands languages spoken in Southeast Asia. 

This is under what they call as Project SEALD -- which means Southeast Asian Languages in One Network Data. I know, that should have been called SEALOND. 

In any case, here's a summary of what they are doing:

  • Building a big library of text data in languages, starting with Indonesian, Thai, and Filipino.
  • Creating special tools to make AI better understand the unique ways people speak in Southeast Asia.
  • Making this data and these tools available for everyone to use.

Under Project SEALD, AISG and Google Research Asia Pacific (APAC) will work together on:

  • Developing translocalization and translation models, 
  • Establishing best practices for instruction tuning datasets, 
  • Creating tools to enable translocalization at scale, and 
  • Publishing pre-training recipes for SEA languages.

Advancing SEA LLMs for the region
Building on this, AISG is collaborating with Google Cloud to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and open models that meet Google Cloud’s strict enterprise safety and quality standards. Through Vertex AI, organizations can use enterprise-grade tools to easily customize these models to address relevant use cases and integrate them into their applications. In addition, AISG will continue to make its SEA-LION LLMs available on Hugging Face, which has been partnering with Google Cloud to help developers train, tune, and serve open models quickly and cost-effectively.

AISG has also initiated collaborations across Singapore and other SEA countries. For example, AISG has signed Memorandums of Understanding (MOUs) or Letters of Intent (LOIs) with Indonesian, Malaysian, and Vietnamese entities for the development of datasets and applications for regional LLMs. In addition, AISG has been engaging partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics. Finally, in the Singapore context, AISG works closely with public sector and R&D stakeholders on safety alignment and multimodality.

In APAC, Google Research has a similar large-scale language inclusivity project ongoing in India with the Indian Institute of Science via Project Vaani—an initiative that is gathering, transcribing, and open-sourcing speech data from across all of India’s 773 districts.