The Web today is a wonderfully addictive mashup of culture, commerce, and technologies, both old and new. As an iOS developer, interacting with the Web is usually trivial. Make an endpoint request of a REST API from a web-server, get back data, decode data. Boom. Done. At least, that’s what I thought, until I ran into odd substrings like ‘"’ and ‘&’ in an xml formatted response I was parsing. These curious bits of text are known as character entity references (CERs) and in this case, stand for the quote ( “) and ampersand (&) characters respectively. In this article I will provide a bit of background about why CERs exist for us non-web developers and give a few practical methods for decoding them in Swift.
What are Character Entity References and why do they exist?
As I mentioned at the start of the article CERs are basically character codes within a string that are sandwiched between an ampersand and semicolon. Your web browser recognizes these codes and automatically replaces them with the appropriate character for rendering on your screen. So, HTML with ‘5 < 6’ renders to ‘5 < 6’. The Worldwide Web Consortium (W3) has a spiffy interactive chart you can have a look at if you are interested in seeing more. For those that want to dig deep you can read the wiki entry or have a close look at the official html spec.
Character Entity References exist for a number of reasons:
To allow for inclusion of reserved characters in HTML. Just like any programming language there are characters that are reserved for the language itself…most of us have at least seen at least a little HTML. Each tag begins with < and ends with >. Is it any surprise these characters are reserved?
To allow for characters not included in the encoding format of the document (90% of web page are UTF-8 these days, so I think this is mostly for edge cases).
As a convenience to the document writers (web devs) for characters that aren’t included on a standard keyboard. Writing ‘©’ is a lot faster and more efficient that going fishing for the copy-write symbol in a special characters library.
Dealing with CERs in Swift
Now that we have the background out of the way I will show you two methods you can use to ‘find and replace’ CERs in Swift.
Option 1: NSAttributedString
If we dip into ObjectiveC (not very Swift-like, I know), NSAttributedString
already has a lot of functionality built around parsing html. Here is an alternate initializer for String
that handles CERs:
extension String {
init?(htmlEncodedString: String) {
guard let data = htmlEncodedString.data(using: .utf8) else {
return nil
}
let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
.documentType: NSAttributedString.DocumentType.html,
.characterEncoding: String.Encoding.utf8.rawValue
]
guard let attributedString = try? NSAttributedString(data: data, options: options, documentAttributes: nil) else {
return nil
}
self.init(attributedString.string)
}
}
Here we briefly initialize an attributed string specifying the DocumentType
as .html
. Character entity references are automatically substituted for the appropriate character on initialization, so all we have to do is return the .string
property and we are done! The new initializer can be used like:
let htmlString = "Easy peasy lemon squeezy. 🍋"
let fixedString = String(htmlEncodedString: htmlString)
print(fixedString)
Easy peasy lemon squeezy. 🍋
Option 2: Regular Expression Matching
For the second technique we will write our own function that use a dictionary of CER -> Character mappings and regular expressions to perform character substitution manually.
Our dictionary will look like this:
let characterEntities : [String: Character] = [
// XML predefined entities:
""" : "\"",
"&" : "&",
"'" : "'",
"<" : "<",
">" : ">",
// HTML character entity references:
" " : "\u{00A0}",
"¡" : "\u{00A1}", ...]
I’ve left out the full list of CER : character mappings, but you get the idea. As for the rest of the implementation, let’s write a new function as an extension on String
so character substation is available whenever we need it. Here is the full code:
extension String {
func replacingCharacterEntities() -> String {
func unicodeScalar(for numericCharacterEntity: String) -> Unicode.Scalar? {
var unicodeString = ""
for character in numericCharacterEntity {
if "0123456789".contains(character) {
unicodeString.append(character)
}
}
if let scalarInt = Int(unicodeString),
let unicodeScalar = Unicode.Scalar(scalarInt) {
return unicodeScalar
}
return nil
}
var result = ""
var position = self.startIndex
let range = NSRange(self.startIndex..<self.endIndex, in: self)
let pattern = #"(&\S*?;)"#
let unicodeScalarPattern = #"&#(\d*?);"#
guard let regex = try? NSRegularExpression(pattern: pattern, options: []) else { return self }
regex.enumerateMatches(in: self, options: [], range: range) { matches, flags, stop in
if let matches = matches {
if let range = Range(matches.range(at: 0), in:self) {
let rangePreceedingMatch = position..<range.lowerBound
result.append(contentsOf: self[rangePreceedingMatch])
let characterEntity = String(self[range])
if let replacement = characterEntities[characterEntity] {
result.append(replacement)
} else if let _ = characterEntity.range(of: unicodeScalarPattern, options: .regularExpression),
let unicodeScalar = unicodeScalar(for: characterEntity) {
result.append(String(unicodeScalar))
}
position = self.index(range.lowerBound, offsetBy: characterEntity.count )
}
}
}
if position != self.endIndex {
result.append(contentsOf: self[position..<self.endIndex])
}
return result
}
}
So what is this function doing? In essence, we take our original string, look for substrings that match our pattern
, iterate over the matches, and build up the result
string by using the ranges found in each match to replace any CERs. For those unfamiliar with using NSRegularExpression
there is an excellent article written by Matt on NSHipster that offers background, examples, and explanations. And while I’m directing you off of this article I should also recommend regex101.com, an interactive website I use all the time for prototyping regex patterns.
This new function can be called on any string as in:
let htmlString = "Easy peasy lemon squeezy. 🍋"
print(htmlString.replacingCharacterEntities())
Easy peasy lemon squeezy. 🍋
Conclusion
Thanks for reading. If you found this article interesting and aren’t already a member of Medium, please consider signing up! You will be supporting me (disclosure: I get part of the membership dues) and get access to tons of great content.
References
https://nshipster.com/swift-regular-expressions/
https://www.w3.org/TR/html4/cover.html#minitoc
https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555
https://www.swiftbysundell.com/articles/string-literals-in-swift/