Cloudy in the clouds

A word cloud is a popular and informative visualization for text processing. It shows most frequently repeating words in bigger and bold representation. Less important information in such cases becomes smaller words. But let’s start from the beginning.

The article is split into several parts. This page is part 3. Other parts are written by Irma Spudiene. All parts are available here:
Part1. Solution overview
Part2. Are they similar app source code walk through
Part3. Cloudy in the clouds (this story)

How to implement wordcloud in the cloud (feels cloudy in the cloud)? In my story, I will describe the challenge I had with word cloud implementation in the Google Cloud Platform. The general requirement for this task was — the representation of near-real-time data in a nice visual way with an interactive word cloud. Visualization had to run for several hours.

I split this challenge into 2 sides:

  • frontend

Let’s start from the beginning: as there were implementation and image recognition implemented in previous parts, there were some cool pictures that we decided to process, get labeled, and prepare wordcloud. For storing labeled data we chose to use a fast and efficient Google Firestore. After when data arrives it should be processed and available in UI for end-user. Processing functionality is implemented in cloud functions in order to have serverless products and the possibility to have event-based triggers on Google Cloud Storage bucket. When any image arrives in the bucket, the function gets triggered. Vision AI adds particular attributes and labels to provided Firestore collection.

from google.cloud import vision
from google.cloud import firestore
import secrets
import proto
def on_image_upload(event, context):
file_name = event['name']
file_bucket = event['bucket']
client = vision.ImageAnnotatorClient()
image = vision.Image()
image.source.gcs_image_uri = f"gs://{file_bucket}/{file_name}"
response = client.label_detection(image=image)
labels = response.label_annotations

image_labels = {}
for label in labels:
image_labels["label_" + secrets.token_hex(8)] = proto.Message.to_dict(label)
db = firestore.Client()
db_ref = db.collection(u'atsapp_labels')
db_ref.add({ 'labels' : image_labels})

As not the biggest fan of frontend technologies, I have started with the backend. I had an initial data source which was a NoSQL database — Cloud Firestore. Started with a python client for Firestore and wordcloud library. All code was written in Vertex AI Workbench managed Jupyter-Lab instance. The idea was to take the data from Firestore, process it with a notebook, and upload generated word cloud visualization to the target in my case it was Google Cloud Storage (further GCS). How to apply it for near-real-time data? I used a simple cheat — for cycle with sleep. So data reading and generation of the word cloud image are scheduled just according to cycle elements and sleep time for one iteration.

import time
import datetime
import os
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from google.cloud import storage
import firebase_admin
from firebase_admin import credentials, firestore
from PIL import Image
import numpy as np
from pandas.io.json._normalize import nested_to_record
# firebase
cred = credentials.Certificate('auth/yourKey.json')
firebase_admin.initialize_app(cred)
db = firestore.client()
# gcs
client = storage.Client(project='yourProject')
bucket = client.get_bucket('yourBucket')
blobs = bucket.list_blobs(prefix='pics/')

# save
strFile = "my_plot.jpg"
# iterations
iterations = 7200
def firebase_reader():
users_ref = db.collection(u'atsapp_labels')
docs = users_ref.stream()
dict_temp = {}
list_temp = []

for doc in docs:
attrlabels = doc.get('labels')
flat = nested_to_record(attrlabels, sep='_')
for key, value in flat.items():
#print(key)
if 'description' in key:
dict_temp[key] = value
list_temp.append(dict_temp[key])
return list_temp

def gcs_clearner():
bucket = client.get_bucket('dj-reader')
blobs = bucket.list_blobs(prefix='pics/')
for blob in blobs:
blob.delete()
def gcs_uploader():
blob = bucket.blob('pics/my_plot.jpg')
blob.upload_from_filename('my_plot.jpg', content_type='image/jpeg')
for count in range(iterations):
time.sleep(2)
list_of_dict_values = firebase_reader()
str1 = ' '.join(list_of_dict_values)
mask = np.array(Image.open('comment.png'))
word_cloud = WordCloud(mask=mask,
collocations = False,
background_color = 'white',
width=mask.shape[1],
height=mask.shape[0]).generate(str1)
word_cloud.to_file(strFile)
gcs_uploader()

Firestore structure:

For the firebase client, you need to have auth keys for the service account. The service account should have roles/firebase.adminand storage read/write primitive roles. The generated key should be added in line:

credentials.Certificate('auth/yourKey.json')

Wordcloud has the possibility to be created in form as you wish. In this case, the “comment” image was used to create wordcloud:

For the frontend, I used a simple flask application that was hosted in Google App Engine. The app reads the picture from GCS and represents it in index.html. The page has small CSS which reloads the page each 60 seconds. There is a small challenge, that when an application reads a picture from the GCS bucket, the picture might be in the renewal process, in such case temp picture like “reloading“ is represented.

from flask import Flask
import matplotlib.pyplot as plt
import io
import base64
from google.cloud import storage
from PIL import Image
from io import BytesIO

app = Flask(__name__)

client = storage.Client.from_service_account_json('yourProject')
bucket = client.get_bucket('yourBucket')

@app.route('/')
def build_plot():
blob = bucket.get_blob('pics/my_plot.jpg')
try:
pic = blob.download_as_bytes()
img = BytesIO(pic)
reloader = str(60)
except:
blob = bucket.get_blob('static/reload.png')
pic = blob.download_as_bytes()
img = BytesIO(pic)
reloader = str(1)

plot_url = base64.b64encode(img.getvalue()).decode()
ret = '<meta http-equiv="refresh" content="{}" /><p style="text-align:center;"><img src="data:image/png;base64,{}"width="700" height="700"></p>'.format(reloader, plot_url)

return ret

if __name__ == '__main__':
app.debug = True
app.run()Graphical representation of the concept:

If reload of the image makes overwrite — the app takes reload image, in case not to get an error. This happens rarely approx 30 iterations (if sleep 2 seconds).

General overview (high-level diagram of solution architecture):

Challenge was completed with the final result:

There are multiple things that can be improved, like:

  • Use triggers to process the data — that will allow avoiding “cycle/sleep” practice

--

--

Data Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store