Dec 4, 2017

Detecting Non-breaking Space in R

Last night while I was cleaning data in R, I encountered such a weird behavior of unique() function that I once suspected that there had been something wrong with the newly updated R 3.4.3. The following is a reproducible example of my problem.

I read in a vector variable x from csv file, and printed it on R console.

## [1] "non-breaking space" "non-breaking space"

Apparently, x contained two “identtical” elements and it should have only one unique value. However,

returned two unique values that are visually the same. It was so weird. According to my experience of data clean in R, I suspected the problem may occur in the space. I tried Space and Tab(displayed as \t on R console, which can be easily distinguished from Space) that were usually encountered during data clean. Unfortunately, the problem was not resolved. I worked on this problem up to 3AM and tried the possibilities that I could think of, including uninstalling the newly updated R 3.4.3 and running above code with an old version of R. After hours of trials and errors, I got a feeling that the problem was relevant to the encoding of the space. Finally I copied the raw data into ASCII Value Tool to show the ASCII value of the space. The ASCII value of x[1] is 32 in Decimal value, which is the common ordinary space. On the contrary, x[2] has the ASCII value of 160 in Decimal value, which corresponds to non-breaking space. In HTML, non-breaking space is common (but this was my first time to encounter non-breaking space while cleaning data in R) and is encoded as   or  . In Unicode, it is encoded as U+00A0. In UTF-8, it is encoded as C2 A0.

I used following R code to confirm that there was non-breaking space in x[2].

## 2: non-breaking<c2><a0>space
## [1] "ASCII" "UTF-8"

showNonASCII() picked out non-ASCII character contained in x[2] and printed it as <c2><a0>. stri_enc_mark showed the encodings for x[1] and x[2] were ASCII and UTF-8.

After knowing the space in x[2] was non-breaking space, I fixed the problem by substituting the non-breaking space with the ordinary space. Now unique(x) returns only one unique value.

y <- gsub("\u00A0", " ", x, fixed = TRUE)
## [1] "non-breaking space"

In most cases, non-breaking space is displayed as the ordinary space character that we cannot visually tell. Therefore, I installed the Unicode Character Highlighter plugin for my commonly used Sublime Text editor. Now my Sublime Text can highlight non-breaking space and I can visually detect it.

At last I ends the post with a non-breaking space geek joke created by Ridzal Zainal. Could you get the point?