R&D for Large Language Model Training Platform:Helping TAIDE Project Create an AI Dialogue Engine Exclusively for Taiwan
CONTRIBUTING TEAM:Yi-Hao Hsiao, Meng-Chi Huang, Shun-Tai Wang
Chat GPT was launched with a big splash near the end of 2022. It quickly grabbed attention from all over the world in a very short time, and the generative AI technology behind Chat GPT has also attracted the attention of industry, government, academia, and research organizations. In order to build Taiwan's own AI dialogue engine, the National Science and Technology Council (NSTC) launched the TAIDE (Trustworthy AI Dialogue Engine) project, for which NCHC is responsible for building a new generation of supercomputers and developing a training platform for a large language model to assist in the creation of an AI dialogue engine that works well with Taiwan's characteristics.
Large language model training developer interface-multi-language model group chat
Built on Taiwanese culture, TAIDE integrates Taiwan's unique language, values, customs, and other elements to create an AI dialogue engine that understands and responds to the needs of local users. The project also gradually brings in trusted traditional Chinese texts in different subject areas, and establishes application examples in specific fields to improve the performance of the model in each subject area. Simply put, TAIDE is a platform designed specifically for Taiwan, allowing government departments and enterprises to choose the right model size and computing power according to their own needs, and train them on their own, so as to build AI engine base models that meet their internal applications.
In the TAIDE project, NCHC is responsible for computing power and applications. We provide the computing resource, accelerated model training required for model training and build the required hardware. Initially, Taiwania 2 was used for model training and inference. In November 2023, the installation of 9 NVIDIA DGX H100 devices was completed, expanding the computing power by 4.8 petaflops, which accelerates the training of larger models and more complex AI applications.
As for applications, we develop TAIDE's demonstration interface, an API for testing inferential computing power and model development. This interface provides many different models for testing and comparison, and it will be used for business collaboration with other companies or service providers.
NCHC's large language model AI training platform has set phased goals for different processes. Basic construction is complete for the platform and is now in the testing stage. The team is still optimizing and fine-tuning. Optimization is focused on data cleaning, enhancing the reliability and accuracy of the model, and ensuring compliance with ethical and regulatory requirements.
At the end of 2023, NCHC launched the first version of its AI training platform for large language models and released the API. Initially it is not open to general users, but it will mainly be used by specific industry, government, academia and research organizations, especially banks, Ministry of Justice, and other institutions that need to handle confidential information, but do not have the ability to train independently. This type of organization can integrate with the original system through the API developed by NCHC team under security control to fine-tune the model and process data. NCHC also provides a specific confidential data processing mechanism for these organizations as required by the laws governing them, helps users reduce data security risks, and simplifies their operational processes.
NCHC large language model training and development program covers, in addition to Chinese, Taiwanese and the indigenous language; the latter is covered as a result of a program of the humanities department at NSTC to deliberately preserve the languages of various ethnic groups. Many other domestic organizations are also focusing on this area presently, but a lack of a unified platform has resulted in scattered data. NCHC will put its AI models and data storage equipment to good use and integrate most of the ethnic language data in Taiwan. And it will invite ethnic language experts to help expand the corpus, and transform it into an AI-suitable database, which will not only cover text, but also speech, and these speech data will also be used for AI training, and two-way presentation of Chinese and ethnic languages, and will cover text translation and speech recognition and synthesis.
AI automatically generates tales translated from Chinese to the Taroko language.
These AI-trained tools can be used not only in education, but also to help tribal schoolchildren learn their tribal language in more engaging ways, such as AI course teaching assistants and storytelling. The project will start from the Taroko tribal language.。
Taroko text-to-speech AI model
Looking to the future, generative AI will become the core technology of the next generation of intelligence. NCHC will be assisting TAIDE in strengthening the performance of software and hardware, building Taiwan's exclusive AI dialogue engine, and meeting the needs of industry-academia-research community. Additionally, NCHC will also continue to make good use of the high-performance computing infrastructure and its team's professional advantages in expertise to develop a large language model AI training platform and expand more local language applications, and will also play a role in preserving and passing on Taiwanese culture.