Word clouds have recently become commonly used in many areas where text data is being analyzed. For example, companies may want to study the reviews customers have written about their products or services, and determine what are the most (and least) common words that appear in the reviews. The purpose of word clouds is to displays text with a size (and sometimes color) proportional to the count of the number of times each word appeared in a document.
The challenge with word clouds is that they are useful for qualitative analysis but not quantitative analysis. As an example, Figure 1 shows a word cloud of common words that appear on a few reviews from a smartphone. When you spend a few minutes analyzing this image, you will notice that the word “battery” seems to be bigger (and more red) than the other words, indicating that in this dataset, users tend to talk a lot about the battery of the product. We can compare this with the word “new” that seems to be smaller (and less red) than the rest of the words. How much bigger would you say the word “battery” is compared to the word “new” ?
Let's now analyze Figure 2, where the word “new” is much bigger than the word “battery”. How many more times would you guess that the word “new” occurs relative to “battery”?
In Figure 3 we observe on the left and right side the relative sizes of both words obtained from Figure 1 and 2, respectively. How many more times is the bigger word compared to its smaller counterpart?
One question you may ask is whether the words should be compared by their width, height or area. For example, on the left side of Figure 4 we observe that the word “battery” is approximately 5 times, 3 times and 5x3 = 15 times the word “new” based on width, height and area, respectively. Similarly, on the right side of Figure 4 we observe that the word “new” is 2 times, 3 times and 2x3 = 6 times the word “battery” based on width, height and area, respectively. Another way of analyzing this is by counting the number of characters that make up the bigger word. In this case, the width of the left side would be over 16 characters, while the one of the right would be about 14 characters.
In reality the larger word occurs 17 times in the dataset compared to the smaller word which occurs only 1 time. This can be observed in Figure 5, that displays a bar chart which presents the word count of the words that appear on the document. The only difference is that the bar chart on the left side contains the word “battery” at the top, and the word “new” at the bottom. On the right side, these words are flipped. All the other words in the document are the same for both cases.
In summary, we can observe that word clouds are a visually appealing way of displaying text data that can provide a qualitative analysis of the frequency of words (for example, we can determine which words occur more often); however, they are limited in providing quantitative information such as how many more times do certain words appear compared to others. A bar chart representations may be a quantitative alternative.
Although the word “battery” occurs as the most common word, this analysis would not indicate if users are speaking positively or negatively about the battery of the smartphone, so further analysis would need to be made to understand the users sentiment.