Researchers from UO CIS, including third-year Ph.D. student Amir Pouran Ben Veyseh and Prof. Thien Huu Nguyen from the Natural Language Processing (NLP) group, have recently won a Best Paper Award at the 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2021 (demonstration track) for their new Web-based System for Acronym Identification and Disambiguation, called MadDog, from texts. In addition, first-year Ph.D. student Minh Van Nguyen, third year Ph.D. students Viet Dac Lai and Amir Pouran Ben Veyseh, and Prof. Thien Huu Nguyen have also won an Outstanding Paper Award at the EACL 2021 conference (demonstration track) for their new toolkit Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing.
MadDog: A Web-based System for Acronym Identification and Disambiguation [Best Demo Paper Award]
Acronyms and abbreviations are short-forms of longer phrases that are ubiquitously employed in various types of writing. Despite their usefulness to save space in writing and reader's time in reading, they also provide challenges for understanding the text especially if the acronym is not defined in the text or if it is used far from its definition in long texts. To alleviate this issue, there are considerable efforts from both the research community and software developers to build systems for identifying acronyms and finding their correct meanings in text. However, none of the existing works provide a unified solution capable of processing acronyms in various domains and to be publicly available. In their newly developed system MadDog (with the Best Paper Award), Amir, Thien, and their collaborators at Adobe Research presented the first web-based acronym identification and disambiguation system which can process acronyms from various domains including scientific, biomedical, and general domains. They also introduced the largest available dataset for acronym disambiguation with more than 46 million samples. The web-based system is publicly available here and a demo video is available here . The system source code is also released here. Detailed information about the systems can be found in their paper:
Amir Pouran Ben Veyseh, Franck Dernoncourt, Walter Chang, and Thien Huu Nguyen. MadDog: A Web-based System for Acronym Identification and Disambiguation. In Proceedings of EACL 2021 (Demonstration Track).
Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing [Outstanding Demo Paper Award]
In their Trankit toolkit (with the Outstanding Paper Award), Minh, Viet, Amir, and Thien introduced a Light-Weight Transformer-based Toolkit for MultilingualNatural Language Processing. Many efforts have been devoted to developing multilingual NLP systems to overcome language barriers. A large portion of existing multilingual systems has focused on downstream NLP tasks that critically depend on upstream linguistic features, ranging from basic information such as token and sentence boundaries for raw text to more sophisticated structures such as part-of-speech tags, morphological features, and dependency trees of sentences (called fundamental NLP tasks). As such, building effective multilingual systems/pipelines for fundamental upstream NLP tasks to produce such information has the potential to transform multilingual downstream systems. There have been several NLP toolkits that concerns multilingualism for fundamental NLP tasks; however, these toolkits come their own limitations, featuring the poor performance, the inability to process raw texts, the requirement for large memory footprints, and the failure to exploit contextualized embeddings from pretrained transformer-based language models to develiter state-of-the-art performance.
In Trankit, UO researchers introduced a multilingual Transformer-based NLP Toolkit that overcomes such limitations. The toolkit can process raw text for fundamental NLP tasks, supporting 56 languages with 90 pre-trained pipelines on 90 treebanks of the Universal Dependency. By utilizing the state-of-the-art multilingual pretrained transformers, Trankit advances state-of-the-art performance for sentence segmentation, part-of-speech (POS) tagging, morphological feature tagging, and dependency parsing while achieving competitive or better performance for tokenization, multi-word token expansion, and lemmatization over the 90 treebanks. It also obtains competitive or better performance for named entity recognition (NER) on 11 public datasets. Trankit is expected to significantly boost the research and development for multilingual NLP. The toolkit along with pretrained models and code are publicly available here. A demo website for Trankit is also available here. Finally, a demo video for Trankit is created here. For more information about Trankit, please refer to the technical paper:
Minh Van Nguyen, Viet Dac Lai, Amir Pouran Ben Veyseh and Thien Huu Nguyen. Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing. In Proceedings of EACL 2021 (Demonstration Track).
The Association for Computational Linguistics (ACL) is the international scientific and professional society for people working on problems involving natural language and computation. An annual meeting is held each summer in locations where significant computational linguistics research is carried out. It was founded in 1962, originally named the Association for Machine Translation and Computational Linguistics (AMTCL). It became the ACL in 1968. The European Chapter of the ACL (EACL) is the primary professional association for computational linguistics in Europe that provides a number of services to its members and the community, including the presentation of cutting-edge research and support for educational initiatives in the field.
These works have been supported by Adobe Research Gifts, the Army Research Office, and the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA) under the Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program.