Serverless is key for voice apps; voice apps are serverless

Voice apps are serverless. This brings significant benefits from day 1. Hence, Serverless is not just part of what we do, it is what we do.

Voice apps are serverless. This brings significant benefits to our development life-cycle, plus development, testing and production environments. Over the years, we have developed some much on serverless that it is becoming a crucial part of what we do. Hence, Serverless is not just part of what we do, it is what we do.

Why voice apps are serverless?

Voice apps are serverless because these types of voice-first solutions rely on complex voice interactions that need Speech Language Understanding (SLU), Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversions. It is simply too complex to create all these components yourself.

Secondly, it is preferable to rely on out-of-the-box solutions that can elastically scale up and down meeting uncertain demand since you can’t control how many users are going to be making requests at the same time.

Finally, you want to focus on your voice application, how to make it engaging, improvements, support, etc. And not on the infrastructure.

What are the components of a voice application solution?

A voice app solution includes:

  • User device (smart TV, smart assistants, mobile phones)
  • Speech Language Understanding (SLU), Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversions
  • A component that controls the user experience, including custom interaction models and intents
  • A back-end component, usually an API that can respond to queries

Voice app solution design

The following diagram describes a generic voice application solution design.

Generic voice application solution design
  1. Alexa users speak (or even are allowed to type in other cases like mobile phones) asking for what they want, for instance, “Alexa, tell me a joke”.
  2. Alexa-enabled devices such as Smart TVs, assistants such as the echo, echo dot or mobile phones with the Alexa application installed can listen for a wake word and activate as soon as one is recognized.
  3. The Amazon Alexa Service performs common Speech Language Understanding (SLU) processing on behalf of the Alexa Skill, including Automated Speech Recognition (ASR), Natural Language Understanding (NLU), and Text to Speech (TTS) conversion.
  4. The Alexa Custom Skill, based on the Alexa Skills Kit, controls the user experience, including a custom interaction model, intents and Alexa Conversations. Within the Alexa Skills Kit, you can also develop Alexa Smart Home Skills to control IoT devices.
  5. The Skill Lambda function has the brains of the architecture. It processes different types of requests sent from the Alexa Service and builds speech responses. Images and special audio effects can be stored in S3.
  6. Dynamo DB (NoSQL data store) is used to persist user state and sessions or any other required data.

Key integration points

Alexa Skills Endpoint

Alexa developer console screenshot

The above screenshot corresponds to point (4) from the solution design. Furthermore, the endpoint receives requests when a user speaks (1) which then calls your logic implemented in the lambda function (5). You could also implement your logic in a separate endpoint (i.e. a google function).

Lambda function and the Alexa skill destionation

This is an important integration point in which the function allows Alexa Skills invocations, otherwise, it won’t work.

voice apps are serverless - lambda function
Lambda function overview

The function overview allows you to see triggers, layers, and destinations for your function. Triggers are AWS services or resources that invoke the function (5) such as the Alexa Skills Kit (4). In this view, you can also configure destinations which are AWS resources that receive a record of an invocation after success or failure. Layers are resources that contain libraries, a custom runtime, or other dependencies.

Conclusion

Voice apps are serverless. They are a good fit due to their reusability allowing you to focus on the engaging bits of the app, elasticity to cope with uncertain workloads and speed to market as a result of the previous points. Solutions are easy to implement in its basic shape with not many integration points which reduces complexity and errors. However, it can be harder to create a unique and engaging experience.

References

  • [1] What is a voice application? — PentaTech Voice
  • [2] How are Alexa voice applications designed? — PentaTech Voice
  • [3] AWS Well-Architected Framework, Serverless Applications Lens — Alexa Skills