Automated Brand Coding (Auto-Coding)
Summary | This article describes how to set up auto-coding for an open question which has multiple brands entered. The keyword 'DLdistance' is used to flag coding where the entered word is very different from the brand text (or an acceptable variation of it). |
to | AskiaDesign |
Written for | Survey programmers, Scripters, Coders |
Keywords | Auto-coding, DLdistance, Text matching, String, InStr, Damerau-Levenshtein Distance, Coding, Split, Separator, Code-frame, Brands, autocoding |
Download the example QEX files in the links below:
- AutoCoding-Simple.qex (use 5.3.5.5 and above.)
- AutoCoding-DLDistance-Flag.qex (use 5.4.5.0 and above).
Auto-Coding
We have an open question as follows:
Which insurers come to mind?
(Please type in your answers in the boxes below)
We have a code frame we will do coding into after:
Manual coding can take a lot of time but we can try to do this in an automated fashion using AskiaScript. The first thing we need to do is set up a replica of the code-frame above but for each code we store all the acceptable variations of the brand text to be allowed – these are separated by a character which will not occur in any of our variation texts, in this case, pipe “|”.
The idea is that we split each code by pipe to create a text array and check whether any of these split texts appear in our initial open response e.g.
- If my open response was simply “NorU”, I would match with a variation of code one texts and be coded as {1}.
- If my open response was “Barkleys… hmm Zorich also”, I would match with a variation of code six and seven texts and be coded as {6;7}.
You can test this in AutoCoding-Simple.qex.
Dim arrFinalResult = {}
Dim arrResult = {}
Dim i
Dim j
Dim arrSpellings
For i = 1 to q13_insurers_spellings.Responses.Count
arrSpellings = q13_insurers_spellings.Responses[i].Caption.Split("|")
For j = 1 to arrSpellings.Count
If Instr(arrSpellings[j],q13_insurers_unp.Value) > 0 Then
arrResult = arrResult + i
Endif
Next j
Next i
Return arrResult
This is fine but there is a flaw with text matching like this sometimes. If I type “Aardvark” as my open response then it will contain the text for the brand AA and I will be coded accordingly as code 11. So how can we get around this?
There is a sliding scale of what might be acceptable variations of listed brands. If we have a measure of how different one word is to another then we can use the result of this comparison (a number) and set a threshold for what we accept and what we need to review or ignore.
In our coding example, we had Aardvark matching to AA. However, Ardvark has 6 differences (6 additional letters)
If we type “AAs”, then again we match to AA but this time there is only one difference (1 additional letter). This one should be coded against code 11 (AA) and not Aardvark.
Auto coding using DL Distance
This is where the DLDistance keyword comes in handy.
In information theory and computer science, the Damerau–Levenshtein distance is a string metric for measuring the edit distance between two sequences.
The keyword is case sensitive. We don’t need to consider case as a differentiator in our coding example so we can apply either . . .
.ToLowerCase() or .ToUpperCase() to our words before we compare them using DLDistance.
dim variation = "AA"
dim response1 = "Aardvark"
dim response2 = "AAs"
variation.ToLowerCase().DLDistance(response1.ToLowerCase()) = 6
variation.ToLowerCase().DLDistance(response2.ToLowerCase()) = 1
So now we have a measure of the difference between strings, we can use the result to flag answers with bigger differences.
Test the second example attached, AutoCoding-DLDistance-Flag.qex (use 5.4.5.0 Design or higher):
Here we start with a numeric question (q13_insurers_dld_threshold) where you define your threshold number. If a coding match has a DL distance greater than or equal to this number then it will be flagged in an open variable later on in the example (q13_insurers_coding_flagged) – there are some intermediate variables:
- q13_insurers_coded – to tell you all brands coded before any cleaning done.
- q13_insurers_in_word_numbers – to tell you which word of you open response has a match with one of the variations.
- q13_insurers_dl_distances – to tell you all the DL distances of words from your open response matched with variations.
The last variable is q13_insurers_coded_cleaned – this is the final coding done based on your open response and after any cleaning done from your DL distance checks. In this example I have chosen to remove any coding done if it is ≥ your threshold.
A couple of final things to note; if I write “AAs, Aardvark” in my open response (with a threshold of 3 set) – I won’t get a flagging for ‘Aardvark’. This is because we already have a match for code 11 (AA) in one of my other words which has a DL distance less than the threshold. There’s no point flagging this as we already have a good match against the same code.
Also, the matching is done on blocks of words - strings separated by space are counted as separate words. So my first word in the example above is “AAs,” which has DL = 2 when compared with AA. You can make tweaks to the routing conditions in the example file to ignore these sorts of punctuation marks if you wish.