Function to calculate the percentage of matching between two strings

Usage

percent_match(
  string1,
  string2,
  case_sensitive = FALSE,
  ignore_whitespace = TRUE,
  frag_size = 2
)

string1 %match% string2

sound_match(string1, string2)

Arguments

string1: first string
string2: second string
case_sensitive: if to check case sensitivity
ignore_whitespace: if to ignore whitespace
frag_size: fragment size of string

Value

numeric value of the match percent

match word sounds

Details

Case Sensitivity:

The function can optionally consider or ignore case sensitivity based on the case_sensitive argument.

Whitespace Handling:

With ignore_whitespace set to TRUE, the function removes all whitespaces before comparison. This can be useful for matching strings that may have inconsistent spacing.

Exact Character-by-Character Matching:

The function computes the percentage of matching characters in the same positions.

Substring Matching:

The function checks if one string is a substring of the other, awarding a full match if true.

Levenshtein Distance:

The function uses Levenshtein distance to calculate the similarity and integrates this into the overall match percentage.

Fragment Matching:

- A frag_size argument is introduced that compares fragments (substrings) of a given size (default is 3) from both strings.
- The function creates unique fragments from each string and compares them to find common fragments.
- The percentage match is calculated based on the ratio of common fragments to the total number of unique fragments.

Combining Metrics:

The overall match percentage is computed as the average of exact match, substring match, Levenshtein match, and fragment match percentages.

Examples

# Example 1: simple match
string1 <- "Hello World"
string2 <- "helo world"

match_percent <- percent_match(string1, string2)
message("Percentage of matching: ", match_percent)
#> Percentage of matching: 009088.8944.72


# Example 2: which date is closest
string0 <- "october 12,1898"
string1 <- "2018-10-12"
string2 <- "1898-10-12"
percent_match(string0, string1)
#> $exact_match_percent
#> [1] 0
#> 
#> $substring_match_percent
#> [1] 0
#> 
#> $levenshtein_match_percent
#>       [,1]
#> [1,] 14.29
#> 
#> $f_m_p
#> [1] 10.53
#> 
#> $overall_match_percent
#>      [,1]
#> [1,]  6.2
#> 
percent_match(string0, string2)
#> $exact_match_percent
#> [1] 0
#> 
#> $substring_match_percent
#> [1] 0
#> 
#> $levenshtein_match_percent
#>       [,1]
#> [1,] 14.29
#> 
#> $f_m_p
#> [1] 23.53
#> 
#> $overall_match_percent
#>      [,1]
#> [1,] 9.45
#> 
percent_match(string0, string2, frag_size = 4)
#> $exact_match_percent
#> [1] 0
#> 
#> $substring_match_percent
#> [1] 0
#> 
#> $levenshtein_match_percent
#>       [,1]
#> [1,] 14.29
#> 
#> $f_m_p
#> [1] 5.88
#> 
#> $overall_match_percent
#>      [,1]
#> [1,] 5.04
#> 
percent_match(string1, string2)
#> $exact_match_percent
#> [1] 70
#> 
#> $substring_match_percent
#> [1] 0
#> 
#> $levenshtein_match_percent
#>      [,1]
#> [1,]   70
#> 
#> $f_m_p
#> [1] 60
#> 
#> $overall_match_percent
#>      [,1]
#> [1,]   50
#> 

sound_match("Robert","rupert")
#> [1] TRUE
sound_match("rupert","Rubin")
#> [1] FALSE
sound_match("book","oops")
#> [1] FALSE