Evaluation of Phonetic Algorithms for the Improvement of String Search in Social Networks.
The recent spread of social networks has radically changed how people use the Internet. These tools allow you to write and share information simply and immediately. Social media texts have several properties that make them different from traditional texts, due to the nature of the conversations on social networks, posted in real time. The texts are not structured and are presented in various formats and written by different people in many languages and styles. In addition, typos and chat slang have become increasingly common on social networks like Facebook and Twitter.
In this dynamic and fast environment, it is not uncommon that many messages present typing problems. Incorrectly written messages often do not impair communication between the interlocutors, because possible errors can be quickly corrected later and incorrect terms usually have some phonetic similarity that allows to overcome these flaws. Despite not interfering decisively in the dialogues, when using Twitter data to conduct social media analysis, incorrectly spelled terms can reduce the amount of records retrieved by classic search algorithms. This scenario tends to decrease the database volume used for analysis and, consequently, decrease the accuracy rate of the studies performed.
A phonetic algorithm is a similarity seach algorithm which transforms an input word into a phonetic code that roughly indicates the way the term is pronounced in a particular language. Thus, phonetic algorithms may play an important role in improving word search in unstructured and noisy data, such as those coming from social networks.
This work studies the impact of phonetic algorithms for searching in non-structured very large text databases. We also suggest evaluation methods for different families of phonetic algorithms and other categories of similarity search methods.