Judicial sentence representation for clustering
The digitization of documents in the Brazilian judicial sector facilitates access to information of public interest. However, in order to be able to raise metrics of interest to this growing information repository, it is essential to organize documents in a way that makes the retrieval of relevant information easier, and machine learning techniques can reduce human effort in organizing a large corpus. This work analyzed different machine learning techniques in face of how well they associate legal terms according to human specialists. To do this, we developed a web scraper to create a corpus of first instance legal sentences. This corpus is composed of 40,009 documents, totaling 24,139,185 tokens. FastText, GloVe and Word2Vec techniques were evaluated for their ability to associate terms in accordance with the Legal Thesaurus of the Federal Supreme Court (TSTF). They were compared when trained both in the general domain of the Portuguese language and in the legal domain of the same language. The FastText model trained on the general domain corpus showed the greatest similarity between the associated terms according to the TSTF. Despite this, the legal domain FastText performed comparable or superior to the general domain GloVe and Word2Vec models. We also evaluated the FastText, GloVe, Word2Vec, Doc2Vec and hashing trick techniques in the task of grouping first instance legal sentences against the subject to which they belong. We compare the trained models in both the general and legal domains using the V-Measure. We conclude that FastText trained on legal domain corpus, with 300 dimensions, presented equivalent or superior results to models trained on the general domain corpus. We also observed that the choice of technique has a greater influence than the choice of hyper-parameters in determining performance. Another factor analyzed in this work was the similarity of documents on different subjects. In this analysis, we used the best model produced in the legal domain: the 300-dimensional FastText. We conclude that despite the uncertainty of the representation created by the model, there seem to be documents on different subjects that are very similar to each other. We also evaluated the performance increase given by the volume of legal documents in the training process, and found that from approximately 800,000 tokens, which is equivalent to approximately 1500 sentences, the marginal performance increases of the 300-dimensional FastText decreases as we add more documentos from the legal domain on the training set. Adding more documents of this domain seems to increase computational cost more than it increases the model performance.