SEA-LION: Representing the Diverse Languages of Southeast Asia with LLMs
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Generative AI |
INDUSTRY | Enterprise Technology, Public Sector |
TECHNOLOGIES | AI/Machine Learning, Apache Spark, GenAI/LLMs |
SKILL LEVEL | Beginner |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
Southeast Asia is one of the world's most culturally diverse regions, covering countries such as Singapore, Vietnam, Thailand, and Indonesia. People speak multiple languages and draw cultural influences from China, India and the West. To reflect these cultural contexts and linguistic influences, the Singapore government entity (AI Singapore) worked with Databricks MosaicML to build SEA-LION, an open-sourced large language model trained on local languages such as Thai, Indonesian and Tamil. This localized LLM is suitable for more than just low-resourced languages. It can also handle unique contexts, such as code-switching between multiple dialects in a sentence. This session goes over the design considerations of SEA-LION, from customizing a tokenizer to regional languages to creating a model cost-effective enough to appeal to resource-constrained organizations in the region. It will also cover potential applications of the model and its long-term vision.
SESSION SPEAKERS
Jeanne Choo
/APJ ML Practice Lead
Databricks
Ngee Chia Tai
/AI Engineer
AI Singapore