Word Count: How I learned one regex was not enough

Links: Repository | Issue | Pull-Request | my-note

I encountered this bug while going through all of my classmates’ applications looking for one that didn’t work correctly in Firefox. It took a while, but I will say it was time well spent.

The Issue

I first noticed something was wrong when I deleted all of the content in the notepad. The word counter read 1. No matter how much I backspaced, the counter never changed to 0.

I then checked the application in Chrome. The word counter did not have that problem. It was at that moment I knew I wanted to fix that bug. How hard could it be? The last bug I fixed because it didn’t work on Firefox only took a line to fix.

Well, turns out that the function only appeared to work in Chrome and I needed to re-acquaintance myself with regular expressions. Do you know what “&nbsp ; ” means? Because now I do.

Bottom line: WordCount() was counting specials characters not visible to the user. (to see the full extend of the issue, checkout the link above)

The Code: WordCount()

function wordCount() {
     var count = 
     document.querySelector("#note").innerHTML.trim().replace(/  +/g, ' ').split(' ').length;
     if (document.querySelector("#note").innerHTML.length == 0) count = 0;
     document.querySelector("#wordNum").innerHTML = "Word Count: " + count;
}

So how the code worked was that it counted the number of spaces in #note after the content was trimmed and that would be the number of words.

The if statement would set the word counter to 0 if there was nothing in #note. This statement was needed because if split() finds nothing, it returns an empty string instead of an empty array. So even if there is nothing in #note, the length would always be one.

However, things start to go wrong, the more time you play with it. Because it counts spaces, it wouldn’t count words added after new lines. In chrome, multiple spaces become a non-breaking space ( &nbsp ; ), so they would not be removed by trim() and that meant they were added to the count. Other special characters that would be added to the count in Chrome were <br> and <div>, you can’t see them while typing, but they were present.

The Fix

  function wordCount() {
                var count = document.querySelector("#note").innerHTML.trim();
                count = count.replace(/ +/g, "");
                count = count.replace(/<[^>]*>/g, " ");
                count = count.replace(/\s+/g, " ");
                count = count.split(" ");
                var p = 0;
                for( var i = 0; i < count.length; i++){
                    if(count[i] == "")
                        ++p;
                }
                count = count.length - p;
               document.querySelector("#wordNum").innerHTML = "Word Count: " + count; 
            } 

What my code does is that it gets rid of all non-breaking white spaces, and it turns all tags(/<[^>]*>/) and whitespace characters (/\s+/) into spaces. Depending on the content, the split() will return an array with some empty strings, so the loop makes sure no empty strings are added to the count.

Final Thoughts

While the fix took me longer than I expected, it was rewarding to figure out what was wrong. After I issued a pull request to RyanWils, I decided to see if any other of my classmates had the same issue. What I found was that nearly everyone who had implemented the word count function had used the same buggy code for it. I opened issues for them too.

Sometimes open source can be helpful, but when you’re not careful, the same bug can spread across the web. How often do you go back to the place where you found the bit of code you took? How long would it take you to realize something in your app is wrong if you never meant to use your code for more than what an assignment called for? How many people will be affected by the code you write?

One thought on “Word Count: How I learned one regex was not enough

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s