ADVERTISEMENT

AI Singapore and Google partner to enhance Southeast Asian Large Language Model training datasets

Published Apr 17, 2024 07:56 pm

AI Singapore and Google are working together to improve artificial intelligence that understands languages spoken in Southeast Asia. 

This is under what they call as Project SEALD -- which means Southeast Asian Languages in One Network Data. I know, that should have been called SEALOND. 

In any case, here's a summary of what they are doing:

  • Building a big library of text data in languages, starting with Indonesian, Thai, and Filipino.
  • Creating special tools to make AI better understand the unique ways people speak in Southeast Asia.
  • Making this data and these tools available for everyone to use.

Under Project SEALD, AISG and Google Research Asia Pacific (APAC) will work together on:

  • Developing translocalization and translation models, 
  • Establishing best practices for instruction tuning datasets, 
  • Creating tools to enable translocalization at scale, and 
  • Publishing pre-training recipes for SEA languages.

Advancing SEA LLMs for the region
Building on this, AISG is collaborating with Google Cloud to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and open models that meet Google Cloud’s strict enterprise safety and quality standards. Through Vertex AI, organizations can use enterprise-grade tools to easily customize these models to address relevant use cases and integrate them into their applications. In addition, AISG will continue to make its SEA-LION LLMs available on Hugging Face, which has been partnering with Google Cloud to help developers train, tune, and serve open models quickly and cost-effectively.

AISG has also initiated collaborations across Singapore and other SEA countries. For example, AISG has signed Memorandums of Understanding (MOUs) or Letters of Intent (LOIs) with Indonesian, Malaysian, and Vietnamese entities for the development of datasets and applications for regional LLMs. In addition, AISG has been engaging partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics. Finally, in the Singapore context, AISG works closely with public sector and R&D stakeholders on safety alignment and multimodality.

In APAC, Google Research has a similar large-scale language inclusivity project ongoing in India with the Indian Institute of Science via Project Vaani—an initiative that is gathering, transcribing, and open-sourcing speech data from across all of India’s 773 districts. 

 

ADVERTISEMENT
.most-popular .layout-ratio{ padding-bottom: 79.13%; } @media (min-width: 768px) and (max-width: 1024px) { .widget-title { font-size: 15px !important; } }

{{ articles_filter_1561_widget.title }}

.most-popular .layout-ratio{ padding-bottom: 79.13%; } @media (min-width: 768px) and (max-width: 1024px) { .widget-title { font-size: 15px !important; } }

{{ articles_filter_1562_widget.title }}

.most-popular .layout-ratio{ padding-bottom: 79.13%; } @media (min-width: 768px) and (max-width: 1024px) { .widget-title { font-size: 15px !important; } }

{{ articles_filter_1563_widget.title }}

{{ articles_filter_1564_widget.title }}

.mb-article-details { position: relative; } .mb-article-details .article-body-preview, .mb-article-details .article-body-summary{ font-size: 17px; line-height: 30px; font-family: "Libre Caslon Text", serif; color: #000; } .mb-article-details .article-body-preview iframe , .mb-article-details .article-body-summary iframe{ width: 100%; margin: auto; } .read-more-background { background: linear-gradient(180deg, color(display-p3 1.000 1.000 1.000 / 0) 13.75%, color(display-p3 1.000 1.000 1.000 / 0.8) 30.79%, color(display-p3 1.000 1.000 1.000) 72.5%); position: absolute; height: 200px; width: 100%; bottom: 0; display: flex; justify-content: center; align-items: center; padding: 0; } .read-more-background a{ color: #000; } .read-more-btn { padding: 17px 45px; font-family: Inter; font-weight: 700; font-size: 18px; line-height: 16px; text-align: center; vertical-align: middle; border: 1px solid black; background-color: white; } .hidden { display: none; }
function initializeAllSwipers() { // Get all hidden inputs with cms_article_id document.querySelectorAll('[id^="cms_article_id_"]').forEach(function (input) { const cmsArticleId = input.value; const articleSelector = '#article-' + cmsArticleId + ' .body_images'; const swiperElement = document.querySelector(articleSelector); if (swiperElement && !swiperElement.classList.contains('swiper-initialized')) { new Swiper(articleSelector, { loop: true, pagination: false, navigation: { nextEl: '#article-' + cmsArticleId + ' .swiper-button-next', prevEl: '#article-' + cmsArticleId + ' .swiper-button-prev', }, }); } }); } setTimeout(initializeAllSwipers, 3000); const intersectionObserver = new IntersectionObserver( (entries) => { entries.forEach((entry) => { if (entry.isIntersecting) { const newUrl = entry.target.getAttribute("data-url"); if (newUrl) { history.pushState(null, null, newUrl); let article = entry.target; // Extract metadata const author = article.querySelector('.author-section').textContent.replace('By', '').trim(); const section = article.querySelector('.section-info ').textContent.replace(' ', ' '); const title = article.querySelector('.article-title h1').textContent; // Parse URL for Chartbeat path format const parsedUrl = new URL(newUrl, window.location.origin); const cleanUrl = parsedUrl.host + parsedUrl.pathname; // Update Chartbeat configuration if (typeof window._sf_async_config !== 'undefined') { window._sf_async_config.path = cleanUrl; window._sf_async_config.sections = section; window._sf_async_config.authors = author; } // Track virtual page view with Chartbeat if (typeof pSUPERFLY !== 'undefined' && typeof pSUPERFLY.virtualPage === 'function') { try { pSUPERFLY.virtualPage({ path: cleanUrl, title: title, sections: section, authors: author }); } catch (error) { console.error('ping error', error); } } // Optional: Update document title if (title && title !== document.title) { document.title = title; } } } }); }, { threshold: 0.1 } ); function showArticleBody(button) { const article = button.closest("article"); const summary = article.querySelector(".article-body-summary"); const body = article.querySelector(".article-body-preview"); const readMoreSection = article.querySelector(".read-more-background"); // Hide summary and read-more section summary.style.display = "none"; readMoreSection.style.display = "none"; // Show the full article body body.classList.remove("hidden"); } document.addEventListener("DOMContentLoaded", () => { let loadCount = 0; // Track how many times articles are loaded const offset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]; // Offset values const currentUrl = window.location.pathname.substring(1); let isLoading = false; // Prevent multiple calls if (!currentUrl) { console.log("Current URL is invalid."); return; } const sentinel = document.getElementById("load-more-sentinel"); if (!sentinel) { console.log("Sentinel element not found."); return; } function isSentinelVisible() { const rect = sentinel.getBoundingClientRect(); return ( rect.top < window.innerHeight && rect.bottom >= 0 ); } function onScroll() { if (isLoading) return; if (isSentinelVisible()) { if (loadCount >= offset.length) { console.log("Maximum load attempts reached."); window.removeEventListener("scroll", onScroll); return; } isLoading = true; const currentOffset = offset[loadCount]; window.loadMoreItems().then(() => { let article = document.querySelector('#widget_1690 > div:nth-last-of-type(2) article'); intersectionObserver.observe(article) loadCount++; }).catch(error => { console.error("Error loading more items:", error); }).finally(() => { isLoading = false; }); } } window.addEventListener("scroll", onScroll); });

Sign up by email to receive news.