Theoretical foundations and elaboration of health domain specialized language corpus regarding COVID-19 in Brazilian Portuguese
This work presents the theoretical foundations and methodology which guided the elaboration of the UFABC COVID-19 corpus. The produced corpus is composed of health domain texts written in Portuguese using specialized language that were published between March 2020 and September 2020. This period corresponds to the initial stage of the pandemic, when knowledge about the coronavirus was restricted to the specialized academic community, especially in its beginning. Under this context, the dissemination of information about the disease and prevention protocols, such as social distancing, played a leading role in combating the pandemic. This type of content used terms imported from specialized domains, containing complex words that imposed comprehension difficulties to the lay public that is not familiarized with medical jargon. This work, therefore, presents the first efforts to elaborate resources to enable research on lexical simplification and complex word identification in the context of COVID-19 combat through the development of the proposed Corpus. The methodology was based on extraction, compilation, storage and categorization of texts obtained from the Pubmed scientific database, which resulted in 254 texts. The categorization process showed a prevalence of approximately 30% of texts related to the fields of Collective Health and Epidemiology, whereas no or little works were found in other medical specialties with more exploratory characteristics such as virology or genomics, reflecting the patterns of the behavior of the scientific community in the pandemic with respect to publications in Portuguese.