Judicial sentence representation for clustering
The digitization of documents in the Brazilian judicial sector facilitates access to information of public interest. However, to be able to gather useful metrics from this growing informational repository, documents must be organized in a way that facilitates the retrieval of relevant information, and machine learning techniques can reduce human effort in organizing a large corpus. In this work, we analyze different machine learning techniques with regards to how well they associate legal terms according to human experts. To this end, a database extracted from the e-Saj website, composed of 40,009 documents, was created. Then, the techniques Word2Vec, FastText, and GloVe were trained using these documents, and the models they produced were compared with counterparts trained in the general domain of the Portuguese language. The Legal Thesaurus of the Portuguese Language was used as a reference for specialist knowledge. Preliminary experiments show that the FastText technique produced models whose association between terms most closely resembles that observed in the Thesaurus, and models trained in the general domain of the Portuguese language performed better in most of the term categories, although this difference is small in some categories. These preliminary results suggest that increasing the number of documents of the legal corpus is a promising solution to achieve model performance that is better than what was observed in models trained in the general context of the Portuguese language, even if the legal corpus is smaller than that used in the general domain.