Penerapan Simhash dan Hamming Distance dalam Deteksi Kemiripan Teks Berita
Keywords:daur ulang teks, deteksi kemiripan teks, hamming distance, simhash
Text reuse is defined as the reuse of existing written sources for creating a new text. The degree of reuse varies from duplicate, near-duplicate to topically similar text. Though some genres of text reuse are acceptable, their existence causes inefficiency of searching and waste of storage. To overcome this problem, a textual similarity detection system is needed. This study focuses on detecting the text similarity by applying the Simhash algorithm. It is used to create document fingerprints which function as document features through which the degree of text similarity can be compared. The similarity of a suspicious text to the source documents are measured then by Hamming Distance. Focusing on the duplicate and near-duplicate detection, the experiments conducted show that the recall of the duplicate detection reaches 80%, meaning that the system is capable of retrieving the duplicate sources of the suspicious document.
L. D. Krisnawati and K. U. Schulz, "Significant Word-based Text Alignment for Text Reuse Detection," in Conference: Int. Conference on Research and Innovation in Computer, Electronic, and Manufacturing Enginerring (RICEME-17), Denpasar, Bali, 2017.
L. Krisnawati, "The use of phraseword and local-weighted terms as features for text reuse and plagiarism detection," in Seminar Hasil Penelitian Bagi Civitas Akademika UKDW, Yogyakarta, Indonesia, 2017.
L. D. Krisnawati, "Plagiarism Detection for Indonesian Texts," Muenchen, 2016.
M. Coe, "Website Indexing," Indexer, vol. 34, no. 1, pp. 20-25, 2016.
L. Pamulaparty, C. Rao and M. Rao, "A Near-Duplicate Detection Algorithm to Facilitate Document Clustering," Intl. Journal of Data Mining and Knowledge Management Process, vol. 4, no. 5, pp. 39-49, 2014.
K. Williams and C. L. Giles, "Near Duplicate Detection in an Academic Digital Library," in Proceedings of the 2013 ACM symposium on Document engineering, 2013.
M. Burgess, E. Giraudy, J. Katz-Samule and J. Walsh, "The Legislative Influence Detector: Finding Text Reuse in State Legislation," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data, California, 2016.
R. Yandrapally, A. Stocco and A. Mesbah, "Inference, Near-Duplicate Detection in Web App Model," in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, Seol, Korea, 2020.
M. Moritz, W. A. B. Pavlek, Y. Bizzoni and M. Buchler, "Non-Literal Text Reuse in Historical Texts: An Approach to Identify Reuse Transformations and its Application to Bible Reuse," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, 2016.
T. Hoad and J. Zobel, "Methods for Identifying Versioned and Plagiarized Documents," Journal of the American Society for Informtion Science and Technology, vol. 54, no. 3, pp. 203-215, 2003.
M. S. Charikar, "Similarity Estimation Techniques from Rounding Algorithms," in Proceedings of the 34th Annual ACM Symposium on Theory of Computing,, 2002.
F. Naumann and M. Herschel, "An Introduction to Duplicate Detection," in Synthesis Lectures on Data MAnagement, Postdam, Morgan & Claypool Publisher, 2010, pp. 1-87.
T. Kopelowitz and E. Porat, "A Simple Algorithm for Approximating the Text-To-Pattern Hamming Distance," in 1st Symposium on Simplicity in Algorithms, Dagstuhl, 2018.
N. C. Haryanto, L. D. Krisnawati and . A. R. Chrismanto, "Retrieval of source documents in a text reuse system," 2020.
How to Cite
Copyright (c) 2023 Mayesti Anggelina, Lucia Dwi Krisnawati Dwi Krisnawati, Danny Sebastian
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Authors who publish articles in JUTEI agree on the following rules:
1. The author grants non exclusive royalty free rights, and is willing to publish articles online and complete (full access). With such rights JUTEI reserves the right to save, transfers, manages in various forms, maintains and publishes articles while keeping the author's name as the copyright owner.
2. Each author contained in the article has contributed fully to the substance and intellectual, and is accountable to the public. If in the future there is a copyright infringement notification then this will be responsibility of the author, not JUTEI.