Speaker: Annie Lu, PhD student, supervised by Yun Sing Koh, and Joerg Wicker
Abstract: Regional varieties of languages such as dialects have proved to have different syntactic and semantic features in the linguistics discipline. However, these dialects have low representation in the conventional corpus training dataset for large-size language models. Therefore, direct use of word embeddings from pre-trained language models may bring unwanted bias while retraining may require large data size and training time. This research aims to seek solutions to the domain gaps and unintentional bias.