Characterization of Biomedical Lexicon in a Corpus About COVID-19 in Portuguese Language
In the present study, we aimed at understanding, from the point of view of specialty languages, the biomedical lexicon of a corpus of scientific texts on COVID-19 in Portuguese. In addition to that, the objective of this research was to provide subsidies to enable the identification of biomedical language patterns to be posteriorly input into artificial intelligence algorithms aimed at automatic detection of grammatical and lexical patterns to assist machine identification and differentiation of biomedical domain content. Thus, in the first phase of the research, a study was carried out on a new indicator, the Lex-BioMed (ASSIS et al., 2021), which proved capable of measuring the distribution of words with biomedical content by medical specialty and text genre that make up the corpus COVID-19 UFABC (LEITE et al., 2020). During the development of Lex-BioMed, we identified a high frequency of noun phrases with an accumulated presence of adjectival modifiers in the corpus. Motivated by these results, we created a statistical model capable of measuring the contextual biomedicality of a term by analyzing its context. In the final stage of the work, we concluded that the context in which the adjectives (non-biomedical in this case) appear is an important factor in the determination of biomedicality, since in our results it was diffused, spreading throughout the syntagmatic axis, making these exception contexts linked to dependency structures. For adjectives classified as Intrinsic in this study, we observed that linearity prevailed, which was already expected, since biomedical language is considered a more objective language, with shorter sentences, where we have a greater commitment to linearity.