PPGCCM PÓS-GRADUAÇÃO EM CIÊNCIA DA COMPUTAÇÃO FUNDAÇÃO UNIVERSIDADE FEDERAL DO ABC Phone: 11 4996-8337 http://propg.ufabc.edu.br/ppgccm

Banca de DEFESA: CRISTIANO OLIVEIRA GONÇALVES

Uma banca de DEFESA de MESTRADO foi cadastrada pelo programa.
DISCENTE : CRISTIANO OLIVEIRA GONÇALVES
DATA : 11/04/2022
HORA: 19:00
LOCAL: Google Meet
TÍTULO:

Judicial sentence representation for clustering


PÁGINAS: 123
GRANDE ÁREA: Ciências Exatas e da Terra
ÁREA: Ciência da Computação
RESUMO:

The digitization of documents in the Brazilian judicial sector facilitates access to information of public interest. However, in order to be able to raise metrics of interest to this growing information repository, it is essential to organize documents in a way that makes the retrieval of relevant information easier, and machine learning techniques can reduce human effort in organizing a large corpus. This work analyzed different machine learning techniques in face of how well they associate legal terms according to human specialists. To do this, we developed a web scraper to create a corpus of first instance legal sentences. This corpus is composed of 40,009 documents, totaling 24,139,185 tokens. FastText, GloVe and Word2Vec techniques were evaluated for their ability to associate terms in accordance with the Legal Thesaurus of the Federal Supreme Court (TSTF). They were compared when trained both in the general domain of the Portuguese language and in the legal domain of the same language. The FastText model trained on the general domain corpus showed the greatest similarity between the associated terms according to the TSTF. Despite this, the legal domain FastText performed comparable or superior to the general domain GloVe and Word2Vec models. We also evaluated the FastText, GloVe, Word2Vec, Doc2Vec and hashing trick techniques in the task of grouping first instance legal sentences against the subject to which they belong. We compare the trained models in both the general and legal domains using the V-Measure. We conclude that FastText trained on legal domain corpus, with 300 dimensions, presented equivalent or superior results to models trained on the general domain corpus. We also observed that the choice of technique has a greater influence than the choice of hyper-parameters in determining performance. Another factor analyzed in this work was the similarity of documents on different subjects. In this analysis, we used the best model produced in the legal domain: the 300-dimensional FastText. We conclude that despite the uncertainty of the representation created by the model, there seem to be documents on different subjects that are very similar to each other. We also evaluated the performance increase given by the volume of legal documents in the training process, and found that from approximately 800,000 tokens, which is equivalent to approximately 1500 sentences, the marginal performance increases of the 300-dimensional FastText decreases as we add more documentos from the legal domain on the training set. Adding more documents of this domain seems to increase computational cost more than it increases the model performance. 


MEMBROS DA BANCA:
Presidente - Interno ao Programa - 334.489.048-48 - THIAGO FERREIRA COVOES - NÃO INFORMADO
Membro Titular - Examinador(a) Interno ao Programa - 1934625 - JESUS PASCUAL MENA CHALCO
Membro Titular - Examinador(a) Externo à Instituição - NÁDIA FÉLIX FELIPE DA SILVA - UFG
Membro Suplente - Examinador(a) Interno ao Programa - 1932365 - FABRICIO OLIVETTI DE FRANCA
Membro Suplente - Examinador(a) Externo à Instituição - ANDRE LUIZ VIZINE PEREIRA - UNIFESP
Notícia cadastrada em: 14/03/2022 16:00
SIGAA | UFABC - Núcleo de Tecnologia da Informação - ||||| | Copyright © 2006-2024 - UFRN - sigaa-1.ufabc.int.br.sigaa-1-prod