Our issue tonight is how to cluster text which have near match.
At disposition, we have a dataset of companies with the term edf inside. Our goal is to find function and ways to cluster altogether the same companies which had been entered with different names. It is a wide subject and to narrow it, we only look at the two functions,
adist, which implement the Levenshtein distance.
Load the data:
We attach the libraries.
The dataset is a set of companies name and descriptions, all related to EDF, obtained in that article.
We use the package
DT to get an overview:
First try: agrep
agrep allow to do approximate match.
We create a matrix of match and count for each name of companies, the number of times we could find a valuable match.
The result is displayed in a datatable format as well.
Our first result is nice and give for each name the number of fuzzy match.
Second try: adist
adist allows to create a matrix of distance between names.
From the moment we have a matrix of distance, it is possible to do an hierachical clustering.
The dendogram is nice to plot, showing the similarities from a wide point of view, as we include in the same graph up to 75 names.