Pub/Sub: Send a million messages per second and save thousands of $ a month using Avro

7 min readOct 11, 2021

Google Cloud Pub/Sub provides messaging between applications. Cloud Pub/Sub is designed to provide reliable, many-to-many, asynchronous messaging between applications. It supports schema-based messages by using Avro and Protocol Buffer. Publisher applications can send messages to a “topic” and other applications can subscribe to that topic to receive the messages in JSON or Binary format. Now, imagine that once you get into the gigabytes or terabytes of data, Choosing the proper data format can significantly impact processing and cost. What should we use? JSON, Avro, or Protocol Buffer.

Binary schema-driven formats like Protocol Buffer, Avro allow compact, efficient encoding with clearly defined forward and backward compatibility semantics. The schemas can be useful for documentation and code generation in statically typed languages.

Let’s take a brief look at these through examples by using this message payload:

Imagine we have an application that generates 1M messages per second — All messages have the same size —Which must be delivered to subscribers. It could be done by message batching. In this case, what data format is more suitable for our scenario? Cost-efficient and good performance.

💡 A minimum of 1 KB per publish, push, or pull request is assessed regardless of message size. This means that for messages smaller than 1000 bytes, it is cheaper to batch multiple messages per request.

👆 Given this quote and the size of our message (96 bytes), we make sure our application follows this approach.

JSON

Obviously, JSON is a contender. Widely known and supported. Its simple design and flexibility make it easy to read and understand, and in most cases, easy to manipulate in the programming language of your choice. However, Do you think JSON is a suitable option in this case? I don’t think so!

1.000.000 x 96Bytes = ~96MB/s * 86400 = ~8,2TB/day = 246TB/month

The cost of this volume of data with Pub/Sub is approximately USD 9,934 per month. 🚑

Now, What if we use something like MessagePack? a binary encoding for JSON. The result for the same message reduces to 81 bytes (by 11%)

1.000.000 x 81Bytes = ~81MB/s x 86400 = ~7TB/day

Likewise, the monthly cost dropped to USD 8,494. Not bad, we saved some money! $1,5K 💰.

Protocol Buffer

Protocol Buffers is a high-performance, compact binary wire format invented by Google who uses it internally to communicate with their internal network services at a very high speed. Also known as Protobuf.

So now, What would be the message encoded using protocol buffer? Here is the HEX representation of bytes that contains the message’s data.

Protobuf — HEX representation of bytes — message encoded

You have probably noticed that only the type and value of data are there. That’s because the sender and receiver have the Protobuf message of this request. Whenever the receiver has this protobuf message, it’s relatively easy to fill the message fields using tags `<request.tag:message.tag>`.

Protobuf Message

'required' or ‘optional' don’t make a difference to how the field is encoded. The difference is simply that ‘required' enables a runtime check that fails if the field is not set.

Well, The message size has been significantly reduced compared to the previous solution. Using Protobuf, the message size is 61 bytes, Which means it is 40% less than JSON (96 bytes).

1.000.000 x 61Bytes = ~61MB/s x 86400 = ~5,2TB/day

The monthly cost has dropped to USD 6,334, which means $3,6K less than JSON💰 .

Avro

Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.

What would be the message encoded using Avro? Here is the HEX representation of bytes that contains the message’s data.

Avro — HEX representation of bytes — message encoded

To parse the binary data, the decoder goes through the fields in the order that they appear in the schema and use the schema to tell you the data type of each field. This means that the binary data can only be decoded correctly if the code reading the data is using the exact same schema as the code that wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly decoded data. Unlike Protobuf, it doesn’t use tag numbers.

However, in some cases, Avro reader resolves differences in fields order between the writer’s schema and the reader’s schema.

Avro reader resolves differences by field name and ignores `age` field — Reader resolves differences by field name and ignores `age` field

Avro Schema

If we compare it with Protocol Buffer, the message size is reduced but not much. By using Avro, the message size is 59 bytes, which means it is 2 bytes less than Protobuf encoded message(61 bytes). Doesn’t it make a big difference?

1.000.000 x 59Bytes = ~59MB/s x 86400 = ~5TB/day

The monthly cost was reduced to USD 6,094. Hmm… Only $250 less than Protobuf 🙄 .

Avro vs Protobuf ⚔️

According to calculations, their cost is not significantly different, and in this case, the evolution of the schema should be considered.

Applications inevitably change over time. What about changing the datatype of a field? or removing a field? In most cases, a change to an application’s features also requires a shift in the data that it stores.

In Pub/Sub, schema evolution is not possible, Therefore, you will have to create a new schema and topic on each version of your schema.

Supporting schema evolution is a fundamental requirement for a streaming platform, so our serialization mechanism also needs to support schema changes (or evolution). Protobuf and Avro offer mechanisms for updating schema without breaking downstream consumers — which could still be using a previous schema version. However, you should check their documents to understand of data types they support.

With Protobuf, you can change the name of a field in the schema since the encoded data never refers to field names, but you cannot change a field’s tag. You can add new fields to the schema by giving that unique tag number. Readers and consumers who use the old schema wouldn’t know anything about the new field (new tag number); therefore, they will ignore that field.

With Avro, You can add a field that has no default value, but, new readers (which has a new schema) wouldn’t be able to read data written by old writers, so you would break backward compatibility. Also, if you remove a field that has no default value, old readers wouldn’t be able to read data written by new writers (which has a new schema), so you would break forward compatibility.

👹 We briefly explained the difference between Protobuf and Avro in Schema changes. Of course, there are many other differences that you need to know and consider in your solution and schema design!

Pub/Sub and Avro

It’s time to taste the Pub/Sub with Avro schema.

First of all, we need to create a new Avro schema. — You can find the schema definition above

Now, we are going to assign this schema to a new topic.

You cannot add schemas to existing topics, only during topic creation. After a schema is associated with a topic, you cannot update the schema or remove its association with that topic. If a schema is deleted, publishing to associated topics will fail.

In the last step, I’m going to publish a message on this topic. (I will use Golang) — I do not use any Avro encoder/decoder library (e.g. linkedin goavro) in this example. It is better if you use it in your application.

Pub/Sub Publisher — Avro

The cost of batching is latency for individual messages, which are queued in memory until their corresponding batch is filled and ready to be sent over the network

If you were produced the wrong message format, you would receive something like this:

InvalidArgument desc = Invalid schema message: Message failed schema validation.

I PULL the message manually, and I see the below message as a decoded message.

Avro libraries (e.g. linkedin goavro) are responsible for decoding this body to an understandable format into the particular structure you introduced earlier.

Features not available currently

Associating schema to existing topics
Updating schema once it’s created
No versioning
Schema evolution is not possible.

Conclusion

💸 In conclusion, We have saved ~$4K per month by choosing Avro schema instead of the JSON format and following the batch-messaging approach. Not all of the above calculations mean the final cost per month; there are undoubtedly other costs to consider, like internet egress and message delivery between Google Cloud regions.

Thanks!