Szymon RozgaPractical Bot Developmenthttps://doi.org/10.1007/978-1-4842-3540-9_14

14. Applying Our Learnings: Alexa Skills Kit

Szymon Rozga¹

(1)

Port Washington, New York, USA

One of book’s goals is to emphasize that the ideas, techniques, and skills introduced throughout apply to many types of applications. In this chapter, by creating a simple Alexa skill, we demonstrate how we can apply our knowledge of intent classification, entity extraction, and dialog construction to create a natural language voice experience. We begin by creating an Alexa skill in the simplest way possible, by using the Alexa Skills Kit SDK for Node.js. Since we already have a bot service back end, you may inevitably ask whether we can integrate Alexa with this back end. The answer is a resounding yes. Once we have our Alexa skill basics down, we will show how to power an Alexa skill via Direct Line and a Bot Framework bot.

Introduction

Alexa is Amazon’s intelligent personal assistant. The first Alexa-enabled devices were the Echo and Echo Dot followed by the screen-enabled Echo Show and Spot. Amazon is also exploring a chat bot platform called Lex. Alexa skills are developed by declaring a set of intents and slots (another name for entities) and writing a webhook to handle incoming Alexa messages. A message from Alexa will include the resolved intent and slot data. Our webhook responds with data that includes speech and user interface elements. In the first iteration of the Echo and Echo Dot, there was no physical screen, so the only user interface was the Alexa app on the user’s phone. The main user interface element on the app is a card, not much different from the hero cards we encountered in the Bot Builder SDK. For instance, a message from Alexa to our webhook will look as follows. Note that the message formats presented in this section are pseudocode because the actual messages are significantly more verbose.

{

"id": "0000001",

"session": "session00001",

"type": "IntentRequest",

"intent": {

"intent": "QuoteIntent",

"slots": [

{

"type": "SymbolSlot",

"value": "apple"

}

]

}

The response would look like this:

{

"speech": "The latest price for AAPL is 140.61",

"card": {

"title": "AAPL",

"text": "The latest price for Apple (AAPL) is $140.61.",

"img": "https://fakebot.ngrok.io/img/d5fa618b"

}

We may want to allow additional functionality such as playing audio files. In keeping with the financial scenario, maybe we have audio briefing content that we would like to play for our users. A message to accomplish this task would look something like this:

{

"speech": "",

"directives": [

{

"type": "playAudio",

"parameters": {

"href": "https://fakebot.ngrok.io/audio/audiocontent1",

"type": "audio/mpeg"

}

]

}

In addition, the system may want to provide an indication of whether the user cancelled audio playback or listened to the entire clip. More generically, the system may need a way to send events to our webhook. In those cases, an incoming message may look like this:

{

"id": "0000003",

"session": "session00001",

"type": " AudioFinished"

}

If we gain use of a screen like the Echo Show device provides, the potential for more actions and behaviors grows. For example, we can now play videos. Or we can present a user interface with images and buttons to our users. If we display a list of items, perhaps we want the device to send an event when an item is tapped. We will then create a user interface render directive, so perhaps our earlier response for a quote will now include a user interface element as follows:

{

"speech": "The latest price for AAPL is 140.61",

"card": {

"title": "AAPL",

"text": "The latest price for Apple (AAPL) is $140.61.",

"img": "https://fakebot.ngrok.io/img/d5fa618b"

"directives": [

{

"type": "render",

"template": "single_image_template",

"param": {

"title": "AAPL",

"subtitle": "Apple Corp.",

"img": "https://fakebot.ngrok.io/img/largequoteaapl"

}

]

}

The great thing about directives is that they are declarative; it is up to the device to determine what to do with them. The Echo Show and Echo Spot devices, for example, may render templates in a slightly different but consistent manner. The Echo and Echo Dot might ignore or raise an error in the case that they receive an unsupported directive, such as playing a video.

Creating a New Skill

Creating a new Alexa skill requires having access to an Amazon developer account for skill registration and an Amazon Web Services (AWS) account to host the skill code. To get started, navigate to https://developer.amazon.com and click the Developer Console link. If you have an account, sign into it. Otherwise, click Create your Amazon Developer Account. We will be asked for an e-mail and a password, our contact information, and a developer or company name; we will also need to accept the app distribution agreement and to answer a couple of questions about whether our skill will accept payments or display ads. We can leave both answers selected as No to those last two questions. At this point, we will be taken to the dashboard (Figure 14-1).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig1_HTML.jpg — Figure 14-1
Not much on this dashboard

Click the Alexa Skills Kit header item. We will now be placed in the Alexa Skills Kit Developer Console, with an empty list of skills. After clicking Create Skill, we must enter a skill name. After that, we must select a model to add to the skill. There are a few types of skills with prebuilt natural language models to choose from, but for this case we choose to build our own models, so we select the Custom skill.¹ After selecting the Custom type, click the Create Skill button. We are now met with the skill dashboard (Figure 14-2). The dashboard includes the ability to create the skill’s language models, as well as configure, test, and even publish the skill.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig2_HTML.jpg — Figure 14-2
New custom skill dashboard

There is a convenient Skill builder checklist area on the right side of the page that we will follow. We will begin by setting our skill’s invocation name. This is the phrase used to identify the skill when users want to invoke it on their Alexa device. For example, in the “Alexa, ask Finance Bot to quote Apple” utterance, Finance Bot is the invocation name. Clicking the Invocation Name checklist item loads the screen to set this up (Figure 14-3). After entering the name, click Save Model.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig3_HTML.jpg — Figure 14-3
Setting up a skill invocation name

Before we jump into setting up our natural language model, or interaction model, we need to enable the right interfaces. Recall that we spoke about the ability to send directives to the device such as to play audio files or render a user interface element. We have to explicitly enable those features in our skill. Click the Interfaces link on the left-side navigation pane. Within this UI, enable Audio Player, Display Interface, and Video App (Figure 14-4). We will experiment with all of these in our chapter exercises.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig4_HTML.jpg — Figure 14-4
Enabling Alexa interfaces

We are now ready to work on the Alexa interaction model.

Alexa NLU and Automatic Speech Recognition

You may have noticed that when we first created the skill, we had three built-in intents in our skill’s model. These are displayed on the left-side pane. After enabling the various interfaces, we now have about 16 intents. As the Alexa system adds more features, more and more intents will be added to all the skills.

This highlights the first difference between the Alexa interaction model and Language Understanding Intelligent Service (LUIS) , explored in depth in Chapter 3. LUIS is a general-purpose natural language understanding (NLU) platform that can be utilized in just about any natural language application. Alexa is a specific ecosystem around digital assistant devices. To create a consistent experience across all Alexa skills, Amazon provides a set of common built-in intents for all skills prefixed by AMAZON. (Figure 14-5). For the best user experience, our skill should implement as many of these as possible or fail gracefully if they do not apply. Amazon will review all of these during the skill review process. As an aside, we do not cover skill review and certification in this book; Amazon provides ample detailed documentation around this process.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig5_HTML.jpg — Figure 14-5
Built-in Alexa intents

As if the set of 16 listed is not enough, Amazon provides a total of 133 built-in intents for our skills to take advantage of. It is useful for us to become familiar with the set provided by Amazon, as the list continues evolving independent of our skills. Of course, writing a custom skill implies adding custom intents. As we create a finance bot skill, we will create a quote intent that will allow us to get a quote either for a company or for a symbol. To add a new custom intent, click the Add button next to the Intents header on the left. Select the Create custom intent checkbox, enter the name, and click the Create custom intent button (Figure 14-6).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig6_HTML.jpg — Figure 14-6
Adding the QuoteIntent custom intent

We are taken to the Intents screen where we can enter sample utterances (Figure 14-7). Note that the intent is added on the left-side pane and there is a trash button next to it should we choose to remove the intent from our model.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig7_HTML.jpg — Figure 14-7
Populating sample utterance for QuoteIntent

Next, we need to be able to extract the name of the company or symbol that we want to get a quote for. In LUIS we would create a new entity for this purpose; in the Alexa world, this is known as a slot. We will create a custom slot type called QuoteItem and give it a few examples of company names or symbols. We first add a new slot type by clicking the Add button next to the Slot Types header in the left pane (Figure 14-8). Note that there are 96 built-in slot types! Those include everything from dates and numbers to actors, sports, and even video games. There is a Corporation slot type that could fit our purpose, but we choose to proceed with a custom slot type as an exercise. Select the Create custom slot type radio button, enter a name, and click the Create custom slot type button.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig8_HTML.jpg — Figure 14-8
Adding a new slot type

Next, we enter the various values for the QuoteItem slot type (Figure 14-9).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig9_HTML.jpg — Figure 14-9
Adding new values to a custom slot type

This is a limited set, of course, but it will do for now. The universe of company names and ticker symbols is quite large, and we are not expecting to enter all of them in the sample slot values. However, the more examples we provide, the better the NLU engine will be at correctly identifying QuoteItems, and the better the Automatic Speech Recognition (ASR) engine will be. The reason for this latter point is that speech recognition systems such as Alexa, Google Home, and Microsoft’s Cortana can all be primed with different utterances. Priming is an important step in the ASR process as it gives clear hints to the engine about the skill’s vocabulary. This allows the ASR system to understand context and better transcribe users’ utterances.

Let’s go back into the QuoteIntent. In Alexa’s NLU, we must explicitly add slot types to intents. Below the sample utterances, the intent user interface lets us add slots. Give the slot a name and click the + button. Now, we are able to assign a slot type (Figure 14-10).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig10_HTML.jpg — Figure 14-10
Adding the QuoteItem slot type to QuoteIntent

Finally, we must correctly label the slot in each utterance. We can do this by selecting a word or set of consecutive words in the sample utterance interface. We will see a pop-up with the intent slots you can assign to the selected substring. After choosing QuoteItem for each one, our QuoteIntent will look like Figure 14-11.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig11_HTML.jpg — Figure 14-11
The QuoteIntent is now ready

We will add one more intent. We want the ability to ask for information about specific account types using utterances like “get information for 401k account” or “what is a roth ira?” Let’s call this intent GetAccountTypeInfoIntent. Before we create the intent, let’s create the supporting slot type. In the same way that we added the QuoteItem slot type, let’s add an AccountType custom slot type.

Once it’s created, enter a set of different account types and different ways of expressing them. For example, 401k can also be referred to as 401(k). Note, we also specify the word spelling of each account type (Figure 14-12). The reason for this is that the ASR system may transcribe user input as words, not numbers. Note that the set of account types will most likely be a closed set for our application, so this presents a different use case from the open concept of a QuoteItem in our QuoteIntent.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig12_HTML.jpg — Figure 14-12
Creating a custom slot type with synonyms

Now we can create a new custom intent called GetAccountTypeInformationIntent. Add the AccountType as an intent slot. Then we can enter some sample utterances. The result is found in Figure 14-13.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig13_HTML.jpg — Figure 14-13
Finalized GetAccountTypeInformationIntent

At this point, we have finished the first draft of our interaction model. Click the Save Model button, followed by the Build Model button. Building the model will utilize all the data we have provided to train the system. Note that at any point we can see the model JSON format using the JSON Editor link in the left pane. The JSON encapsulates everything that was added to the model. Figure 14-14 shows an excerpt of it. The easiest way to share a model is to share this JSON content. Of course, there are also command-line tools to further automate this process.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig14_HTML.jpg — Figure 14-14
An excerpt of the Alexa interaction model we just created

For the purposes of this chapter, this is all we will cover about Alexa’s NLU. To be clear, we did not do it justice. The system is rich and worth learning about.

Diving Into Alexa Skills Kit for Node

Back in the dashboard, the last step in the Skill builder checklist is to set up the endpoint. The endpoint is the code that will receive the incoming messages from Amazon and respond with speech, cards, and directives.

There are two approaches we can take here. First, we can host an endpoint ourselves, give Amazon the URL, parse each request, and respond accordingly. Using this approach, we gain control but must implement the verification and parsing logic ourselves. We would also own the deployment tasks.

The second alternative, which is quite common these days, is to use serverless computing.² This gives us the ability to create bits of code in the cloud that run and scale according to demand. On AWS, this is Lambda. In Azure, the equivalent would be Functions. Amazon provides the Amazon Alexa Skills Kit SDK for Node.js for this very purpose ( https://github.com/alexa/alexa-skills-kit-sdk-for-nodejs ). In this section, we dive into running Alexa Skills on AWS Lambda.

The structure of a skill built using the Alexa Skills Kit SDK is shown next. We register all the intents we want to handle in the code. The emit function sends responses to Alexa. There are many different overloads of emit documented on the SDK’s GitHub site.³

const handlers = {

'LaunchRequest': function () {

this.emit('HelloWorldIntent');

'HelloWorldIntent': function () {

this.emit(':tell', 'Hello World!');

}

};

Finally, we register the skill and handlers with the Alexa SDK.

const Alexa = require('alexa-sdk');

exports.handler = function(event, context, callback) {

const alexa = Alexa.handler(event, context, callback);

alexa.registerHandlers(handlers);

alexa.execute();

};

This code is sufficient to run a basic skill that responds with “hello world” when launched or when the HelloWorldIntent intent is matched. Conceptually, we will follow the same approach when creating the code for our financial skill. Before we continue, though, how do we connect our skill to an AWS Lambda?

First, we will need to have an AWS account. We can create an AWS free tier account here: https://aws.amazon.com/free/ . The free tier is a perfect way to get started and become familiar with AWS. Click Create Free Account. We will be asked for an e-mail address, a password, and an AWS account name (Figure 14-15).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig15_HTML.jpg — Figure 14-15
Creating a new AWS account

Next, we will enter our personal contact information. We will need to enter our payment information for identity verification purposes (you will not be charged while in the free tier) and verify our phone number. Once completed, we will be taken to the AWS Management Console. At this point, we can find Lambda in the “All services” list and navigate to it.

Now we can start creating a Lambda function. Click “Create a function,” select Blueprints, find and select the alexa-skill-kit-sdk-factskill, and click the Configure button. We give the function a name unique to our account’s function list, set Role to Create new role from template(s), give the role a name, and select the Simple Microservice permissions template (Figure 14-16).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig16_HTML.jpg — Figure 14-16
Creating a new Lambda function

Below the data entry fields, we will see our Lambda code. The runtime should be set to Node.js 6.10, though it is safe to assume Amazon may update this any time. We leave the code as is for now. After clicking the Create Function button, you will be taken to the function configuration screen (Figure 14-17).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig17_HTML.jpg — Figure 14-17
Function configuration screen

There are many actions we can perform on this screen. First, the top right shows the Lambda identifier. We will need to present this to the Alexa skill momentarily. We also see that the function has access to CloudWatch logs (all Lambda logs are sent to CloudWatch) and DynamoDB, Amazon’s managed cloud NoSQL database. Alexa skills can use DynamoDB to store skill state.

In the Designer section, we need to set a trigger that can invoke our new function. For our purposes, find and click the Alexa Skills Kit trigger. Once you do so, a Configure Triggers section will appear below. Enter the skill ID from the Alexa Skill dashboard. It should look like amzn1.ask.skill.5d364108-7906-4612-a465-9f560b0bc16f. Once you have entered the ID, click Add for the trigger and then save the function configuration. At this point, the Lambda function is ready to be called from our skill.

Before we do so, we select the function in the Designer (in this case, srozga-finance-skill-function as per Figure 14-17); we will be greeted with the code editor. We have a few different options of how code is loaded into Lambda. One option is to write the code manually in the editor; another option is to upload a zip with all the code. Doing this manual labor in a real application gets tiring very quickly; you can utilize the AWS⁴ and ASK CLI⁵ to deploy a skill from the command line. For now, we will simply use the editor. Replace the code in the editor with the following:

'use strict';

const Alexa = require('alexa-sdk');

const handlers = {

'LaunchRequest': function () {

this.emit(':tell', 'Welcome!');

'QuoteIntent': function () {

this.emit(':tell', 'Quote by company.');

'GetAccountTypeInformationIntent': function () {

this.emit(':tell', 'Getting account type.');

}

};

exports.handler = function (event, context, callback) {

const alexa = Alexa.handler(event, context, callback);

alexa.registerHandlers(handlers);

alexa.execute();

};

Before we leave, copy the Lambda function’s Amazon Resource Name (ARN) from the top-right area of the screen. The identifier looks like this: arn:aws:lambda:us-east-1:526347705809:function:srozga-finance-skill-function.

Let’s switch back into the Alexa Skill configuration screen for our skill. Select the Endpoint link in the right-side pane. Select the AWS Lambda ARN checkbox and enter the Lambda ARN in the Default Region text box (Figure 14-18).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig18_HTML.jpg — Figure 14-18
Alexa skill Lambda ARN endpoint configuration

Click the Save Endpoints button. If there are issues here, you may not have correctly added the Alexa Skills Kit trigger for the Lambda function.

At this point we can navigate into the Test section, using the top navigation panel. By default, the skill is not enabled for test. Toggle the checkbox. Now, we can test the skill from the Alexa test interface, any Echo device connected to the developer account, or third-party tools such as EchoSim.⁶ You may be prompted to allow microphone access if you want to speak to the test application.

We can send input utterances by either speaking or typing, and we will receive our lambda function’s response, as shown in Figure 14-19. Make sure to preface your utterances with “Ask {Invocation Name}.” Note that this interface presents the raw input and output JSON content. Take some time to examine it; it contains a lot of information we covered earlier in the chapter. For example, the incoming request includes the resolved intent and slots from our interaction model. The output contains SSML for the Echo device to speak. The output also indicates that the session should end. We will dive a bit deeper into sessions later.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig19_HTML.jpg — Figure 14-19
Success!

Now that we see the incoming JSON and the slot format, we can extend the code to extract the slot values. In the context of an intent handler, the this.event.request object contains the resolved intent and slot values. From there, it’s simply a matter of extracting the values and doing something with them. The following code extracts the slot values and includes them in the Alexa voice response:

'use strict';

const Alexa = require('alexa-sdk');

const handlers = {

'LaunchRequest': function () {

this.emit(':tell', 'Welcome!');

'QuoteIntent': function () {

console.log(JSON.stringify(this.event));

let intent = this.event.request.intent;

let quoteitem = intent.slots['QuoteItem'].value;

this.emit(':tell', 'Quote for ' + quoteitem);

'GetAccountTypeInformationIntent': function () {

console.log(JSON.stringify(this.event));

let intent = this.event.request.intent;

let accountType = intent.slots['AccountType'].value;

this.emit(':tell', 'Getting information for account type ' + accountType);

}

};

exports.handler = function (event, context, callback) {

const alexa = Alexa.handler(event, context, callback);

alexa.registerHandlers(handlers);

alexa.execute();

};

A sample interaction with input “ask finance bot what is an ira” is presented in Figure 14-20. If you speak the utterance, it will come through as “ask finance bot what is an I R A.” Make sure “I R A” is one of the synonyms for the IRA account type slot type.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig20_HTML.jpg — Figure 14-20
Successfully extracting AccountType slot values from Alexa request

Note that if we send the skill something that the built-in Amazon intents should handle, as perhaps “cancel,” the skill might return an error. The reason for this is that we do not yet handle some of those built-in intents. In addition, we do not include unhandled intent logic. We can easily handle both cases by adding the following handlers:

'AMAZON.CancelIntent': function() {

this.emit(':tell', 'Ok. Bye.');

'Unhandled': function() {

this.emit(':tell', "I'm not sure what you are talking about.");

}

Now, telling the skill “cancel” results in a good-bye message (Figure 14-21).

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig21_HTML.jpg — Figure 14-21
The sassy message we promised when asking the skill to cancel

Great. This works well, but how do we model a dialog into an Alexa Skill? The SDK for Node.js includes the concept of state. Think of it as the user’s current dialog. For each state, we provide a set of handlers for each intent supported by that state. Essentially, we are encoding a dialog graph by using a set of state names and handlers. The code for this skill follows:

'use strict';

const Alexa = require('alexa-sdk');

const defaultHandlers = {

'LaunchRequest': function () {

this.emit(':ask', 'Welcome to finance skill! I can get your information about quotes or account types.', 'What can I help you with?');

'GetAccountTypeInformationIntent': function () {

this.handler.state = 'AccountInfo';

this.emitWithState(this.event.request.intent.name);

'QuoteIntent': function () {

this.handler.state = 'Quote';

this.emitWithState(this.event.request.intent.name);

'AMAZON.CancelIntent': function () {

this.emit(':tell', 'Ok. Bye.');

'Unhandled': function () {

console.log(JSON.stringify(this.event));

this.emit(':ask', "I'm not sure what you are talking about.", 'What can I help you with?');

}

};

const quoteStateHandlers = Alexa.CreateStateHandler('Quote', {

'LaunchRequest': function () {

this.handler.state = '';

this.emitWithState('LaunchRequest');

'AMAZON.MoreIntent': function () {

this.emit(':ask', 'More information for quote item ' + this.attributes.quoteitem, 'What else can I help you with?');

'AMAZON.CancelIntent': function () {

this.handler.state = '';

this.emitWithState(this.event.request.intent.name);

'QuoteIntent': function () {

console.log(JSON.stringify(this.event));

let intent = this.event.request.intent;

let quoteitem = null;

if (intent && intent.slots.QuoteItem) {

quoteitem = intent.slots.QuoteItem.value;

} else {

quoteitem = this.attributes.quoteitem;

}

this.attributes.quoteitem = quoteitem;

this.emit(':ask', 'Quote for ' + quoteitem, 'What else can I help you with?');

'GetAccountTypeInformationIntent': function () {

this.handler.state = '';

this.emitWithState(this.event.request.intent.name);

'Unhandled': function () {

console.log(JSON.stringify(this.event));

this.emit(':ask', "I'm not sure what you are talking about.", 'What can I help you with?');

}

});

const accountInfoStateHandlers = Alexa.CreateStateHandler('AccountInfo', {

'LaunchRequest': function () {

this.handler.state = '';

this.emitWithState('LaunchRequest');

'AMAZON.MoreIntent': function () {

this.emit(':ask', 'More information for account ' + this.attributes.accounttype, 'What else can I help you with?');

'AMAZON.CancelIntent': function () {

this.handler.state = '';

this.emitWithState(this.event.request.intent.name);

'GetAccountTypeInformationIntent': function () {

console.log(JSON.stringify(this.event));

let intent = this.event.request.intent;

let accounttype = null;

if (intent && intent.slots.AccountType) {

accounttype = intent.slots.AccountType.value;

} else {

accounttype = this.attributes.accounttype;

}

this.attributes.accounttype = accounttype;

this.emit(':ask', 'Information for ' + accounttype, 'What else can I help you with?');

'QuoteIntent': function () {

this.handler.state = '';

this.emitWithState(this.event.request.intent.name);

'Unhandled': function () {

console.log(JSON.stringify(this.event));

this.emit(':ask', "I'm not sure what you are talking about.", 'What can I help you with?');

}

});

exports.handler = function (event, context, callback) {

const alexa = Alexa.handler(event, context, callback);

alexa.registerHandlers(defaultHandlers, quoteStateHandlers, accountInfoStateHandlers);

alexa.execute();

};

Note that this skill has two states: Quote and AccountInfo. Within the context of these states, each intent may produce different behavior. If a user asks about an account in the Quote state, the skill redirects to the default state to decide what to do with the request. Likewise, if a user asks about a quote in the AccountInfo state, similar logic happens. An illustration of the dialogs is presented in Figure 14-22. Note that in the code, we use this.emit(‘:ask’) if we want to keep the session open and this.emit(‘:tell’) if we simply want to speak and answer and close the session. If the session stays open, we do not have to preface each utterance to Alexa with ask finance bot.” It is implicit since the session between the user and our skill stays open.⁷ There is another way to build responses by utilizing the ResponseBuilder. We can read about it in SDK documentation, and we will use it in Exercise 14-1 to build responses with render template directives.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig22_HTML.jpg — Figure 14-22
An illustration of the dialogs and transitions in our skill

Go ahead and run this sample to gain familiarity with the ideas behind the flow. Of importance is that we take advantage of two fields for state storage: this.handler.state for the name of the current state and this.attributes, which acts as a user conversation data store. Think of this.attributes as the privateConversationData dictionary in Bot Builder. These values are not persisted when a session ends by default, but the Alexa Skills Kit for Node.js supports DynamoDB integration for state storage. This would enable our skill to continue an interaction with a user whenever they invoke the skill again.

Other Options

We conveniently ignored a few other options along the way. The skill developer console for our skill contains the Account Linking and Permissions links. Account linking is the process of redirecting the user to an authorization experience via an OAuth flow managed by Alexa. Alexa stores the tokens and sends them to our endpoint as part of each request. Part of the reason this is managed in this manner is that the original Echo did not have a screen. As an affordance, authorization is conducted through the Alexa mobile app, so the Alexa servers need to own the entire OAuth flow.

The Permissions screen lets us request access to certain data on the user’s device such as the device address or Alexa shopping lists (Figure 14-23).

You can find more information on both topics in the Alexa documentation.⁸

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig23_HTML.jpg — Figure 14-23
The Alexa Permissions screen

Exercise 14-1

Connecting to Real Data and Rendering Imagery

In Chapter 11 we integrated with a service called Intrinio to fetch financial data and render it in an image. The goal of this exercise is to connect your Alexa Skill code to the same service and render the image on screen-enabled Echo devices.

1.
Use the code in the previous section as a starting point. Revisit the code from Chapter 11 and ensure that your quote state QuoteIntent handler retrieves quote data from Intrinio and responds with the latest price in voice.
2.
Integrate Chapter 11’s HTML-to-image generation code into your Alexa skill. Remember to add the necessary packages into the package.json file in the Lambda function.
3.
Visit https://developer.amazon.com/docs/custom-skills/display-interface-reference.html to get familiar with how to render display templates. Specifically, you will be using BodyTemplate⁷ to render the image generated in the previous step.
4.
To render the template using the Node.js SDK for Alexa Skills Kit you will need to utilize the response builder ( https://github.com/alexa/alexa-skills-kit-sdk-fornodejs#response-vs-responsebuilder ). The SDK has helpers to generate the template JSON ( https://github.com/alexa/alexa-skills-kit-sdk-fornodejs#display-interface ).
5.
Test the functionality in the Alexa Test utility, EchoSim, and, if available, real Echo devices. What is the behavior of the code in a device without a display?

Your skill should now be rendering your financial quote image on display-enabled Echo devices, and you should have gained hands-on experience testing an Alexa skill using several methods.

Connecting to Bot Framework

The features we have presented thus far are just a fraction of the Alexa Skills Kit capabilities but are sufficient to gain an appreciation for applying this book’s concepts to emerging voice platforms. The process of connecting an Alexa skill to a Bot Framework bot follows a recipe similar to our voice bot implementation for Twilio in Chapter 8. We will show code on how to accomplish this connection given our existing Alexa Skills Kit interaction model. Before we dive into the code, we will discuss several implementation decisions for our solution.

Implementation Decisions Around Bot Framework and Alexa Skills Kit Integration

Typically, we do not suggest that a stand-alone Alexa skill be implemented by using the Bot Framework. If the requirements truly suggest a single platform, staying within the confines of an Alexa interaction model and the Alexa Skills Kit SDK for Node.js running on an AWS Lambda function is sufficient. In the case that our product should support multiple natural language text and voice interfaces, we may want to consider one platform to run our business logic, and the Bot Framework lends itself well to this approach. Once we start down the path of connecting an Alexa skill to the Bot Framework, several important implementation decisions follow. These apply to all types of systems, not just Alexa.

Natural Language Understanding

In the context of our current effort, which NLU platform should we utilize: LUIS or Alexa’s interaction model? If we were to use Alexa’s interaction model, we would have to pass the Alexa intent and slot objects through Direct Line calls into our bot implementation. We could then build a custom recognizer that detects this object’s existence and translates it to the correct intent and entity response object in the Bot Builder SDK. To make it very clear, this is where the utility of recognizers shines: the bot doesn’t care where the intent data comes from.

On the other hand, if we choose to utilize LUIS, we must find a way to pass raw input from Alexa into the bot. The way to achieve this is to mark the entire user input as an AMAZON.LITERAL slot type.⁹ This allows developers to pass the raw user input into the skill code. This does not mean our skill interaction model becomes nonexistent. Remember, Alexa uses the interaction model for its ASR, so we want to give as many examples of utterances and input types that we expect in our skill’s vocabulary. We would need to include all our LUIS utterances in the Alexa interaction model.

In general, since the bot may support more channels than Alexa, maintaining one NLU system, such as LUIS, is be a more maintainable approach. There is no way to break away completely. We still need to ensure our bot correctly handles the built-in intents, such as Stop and Cancel. In the following code sample, in the interest of expediency, we will assume the entire NLU model lives in Alexa and demonstrate a custom recognizer approach.

Channel-Agnostic vs. Channel-Specific Dialogs

When we develop one bot that handles multiple channels, we must decide whether the one dialog implementation can handle all channels or whether each channel should have its own dialog implementation. There are arguments to be made for each, although if you think in terms of the Model View Controller (MVC) pattern,¹⁰ we can come up with an elegant solution. If we consider a dialog to be the controller and the APIs we talk to the model, then we are left with the question of what takes on the role of the view.

We want to create separate pieces of code that can render messages based on the channel. Although the bot service attempts to abstract the channel, we will run into channel-specific behavior at one point or another. For example, we will treat Alexa differently from a text channel. One approach is to create a default view renderer that is used in the dialog with the addition of channel-specific view renderers to support behavior or imagery that diverges from the default. A more generic approach is to simply have different view renderers for voice versus text channels. Figure 14-24 shows a sample flow of this approach in the case of a message from a voice channel.

../images/455925_1_En_14_Chapter/455925_1_En_14_Fig24_HTML.jpg — Figure 14-24
A sample flow of a message incoming from a voice channel such as Alexa and its flow through our system all the way to the view renderers

Alexa Constructs

The Bot Builder SDK abstracts the concept of a text conversation well, but mapping the concepts directly to Alexa is nontrivial. A couple of examples come to mind.

First, when a speech utterance is sent to the Alexa service, it may include an initial speech string plus a re-prompt speech string. The re-prompt is spoken to the user if Alexa poses a question, and the user does not respond in time. Bot Builder activities contain a property for speech but not for re-prompt. In our sample code, we leverage the custom Channel Data field to send this information.

A second example is the Alexa render templates. Although we are not covering them here, Alexa supports a number (seven by the latest count) of templates to display content on display-enabled Echo devices. Each template is a different JSON structure representing a user interface. Although we could try to come up with a way to utilize the hero card objects to communicate these templates to a connector, it is simpler to generate the JSON in a renderer and send in the channel data. Instructing the Echo device to play a video presents a similar dilemma.

A solution to all these problems is to try to render as much as possible using the Bot Builder SDK objects and drop to channel data only when necessary. As illustrated in Figure 14-24, we could even utilize the Bot Builder SDK objects and translate them to channel-specific constructs on the connector layer. In general, though, is it easier to generate the Alexa channel data for each response in an Alexa renderer.

Callback Support

Most channels can send events that have nothing to do with user messages. For example, Facebook sends events about referrals, app handover, checkouts, and payments among others. These are channel-specific messages that need to be handled in the bot, sometimes outside the structure of a dialog. Alexa is no stranger to such events. When a video or audio file is playing on an Echo device, various events about progress, interruptions, and errors are sent to the skill. It is up to our bot code to interpret those events correctly.

A good approach to this interaction is to create custom recognizers that can identify the different types of messages and then direct these messages to the right dialogs. For events that require a JSON response, the dialogs should send a payload using the channel data.

Sample Integration

Let’s dig into what a sample integration would look like. We split the implementation into three components: the connector, the recognizer, and the bot. The full sample code can be found under the chapter14-alexa-skill-connector-bot folder in the book’s GitHub repo.

The connector consists of an HTTP handler that Alexa will send messages to. The goal of the handler is to resolve the conversation, call the bot, wait for a response from the bot, and send the message back to Alexa. There is a bit of code here, so let’s walk through it step-by-step.

The message comes into the handler. We extract the request body and the user ID. We then create an MD5 hash of the user ID. The reason for doing this is that Alexa user IDs are longer than the Bot Framework supports. A hash helps us keep the length manageable.

const cachedConversations = {};

exports.handler = function (req, res, next) {

const reqContents = req.body;

console.log('Incoming message', reqContents);

const userId = reqContents.session.user.userId;

const userIdHash = md5(userId);

...

};

We next either retrieve a cached conversation for that user or create a new one. Note, we store the conversations in memory, so every server restart will create new Direct Line conversations. In production, we would use a persistent store using a service such as Cosmos DB or Azure Table Storage. Alexa also includes a flag that informs us whether a session has just started. In the case that we do not have a cached conversation or the session is new, we create a new Direct Line conversation and cache it.

const cachedConv = cachedConversations[userId];

let p = Promise.resolve(cachedConv);

if (reqContents.session.new || !cachedConv) {

p = startConversation(process.env.DL_KEY).then(conv => {

cachedConversations[userId] = { id: conv.conversationId, watermark: null, lastAccessed: moment().format() };

console.log('created conversation [%s] for user [%s] hash [%s]', conv.conversationId, userId, userIdHash);

return cachedConversations[userId];

});

}

p.then(conv => {

...

});

After we retrieve the conversation, we post an activity to the bot. Note that since we decided to pass the resolved Alexa interaction model intents and slots, we simply pass the Alexa message through the channel data in the sourceEvent property.

postActivity(process.env.DL_KEY, conv.id, {

from: { id: userIdHash, name: userIdHash }, // required (from.name is optional)

type: 'message',

text: '',

sourceEvent: {

'directline': {

alexaMessage: reqContents

}

}).then(() => {

...

});

If Alexa sent a SessionEndedRequst, we automatically respond with an HTTP 200 status code.

if (reqContents.request.type === 'SessionEndedRequest') {

buildAndSendSessionEnd(req, res, next);

return;

}

function buildAndSendSessionEnd(req, res, next) {

let responseJson =

{

"version": "1.0"

};

res.send(200, responseJson);

next();

}

Otherwise, we use the Direct Line polling mechanism to try to get the activity response from the bot. We time out after six seconds. Once a response activity has been identified, we extract some Alexa-specific information from the activity and build a response to Alexa. If the message had timed out, we send back an HTTP 504 status code.

let timeoutAttempts = 0;

const intervalSleep = 500;

const timeoutInMs = 10000;

const maxTimeouts = timeoutInMs / intervalSleep;

const interval = setInterval(() => {

getActivities(process.env.DL_KEY, conv.id, conv.watermark).then(activitiesResponse => {

const temp = _.filter(activitiesResponse.activities, (m) => m.from.id !== userIdHash);

if (temp.length > 0) {

clearInterval(interval);

const responseActivity = temp[0];

console.log('Bot response:', responseActivity);

conv.watermark = activitiesResponse.watermark;

conv.lastAccessed = moment().format();

const keepSessionOpen = responseActivity.channelData && responseActivity.channelData.keepSessionOpen;

const reprompt = responseActivity.channelData && responseActivity.channelData.reprompt;

buildAndSendSpeech(responseActivity.speak, keepSessionOpen, reprompt, req, res, next);

} else {

// no-op

}

timeoutAttempts++;

if (timeoutAttempts >= maxTimeouts) {

clearInterval(interval);

buildTimeoutResponse(req, res, next);

}

});

}, intervalSleep);

That’s it! The code to build the response messages follows.

function buildTimeoutResponse(req, res, next) {

res.send(504);

next();

}

function buildAndSendSpeech(speak, keepSessionOpen, reprompt, req, res, next) {

let responseJson =

{

"version": "1.0",

"response": {

"outputSpeech": {

"type": "PlainText",

"text": speak

// TODO REPROMPT

"shouldEndSession": !keepSessionOpen

}

};

if (reprompt) {

responseJson.reprompt = {

outputSpeech: {

type: 'PlainText',

text: reprompt

}

};

}

console.log('Final response to Alexa:', responseJson);

res.send(200, responseJson);

next();

}

function buildAndSendSessionEnd(req, res, next) {

let responseJson =

{

"version": "1.0"

};

res.send(200, responseJson);

next();

}

The Direct Line functions are the same as those we showed in Chapter 9.

What happens with the message on the bot side of things? First it will hit our custom recognizer. The recognizer first ensures we are getting an Alexa message and that it is either an IntentRequest, LaunchRequest, or SessionEndedRequest request. If it is an IntentRequest, we resolve the Alexa intent and slots as the intent and entities for LUIS. As the comments note, the format of the slots object is different from the LUIS entities object. If we were to mix both NLU systems in one bot to use the same dialogs, we would have to ensure that the format is normalized. If the request is LaunchRequest or SessionEndedRequest, we simply pass through those strings as bot intents.

exports.recognizer = {

recognize: function (context, done) {

const msg = context.message;

// we only look at directline messages that include additional data

if (msg.address.channelId === 'directline' && msg.sourceEvent) {

const alexaMessage = msg.sourceEvent.directline.alexaMessage;

// skip if no alexaMessage

if (alexaMessage) {

if (alexaMessage.request.type === 'IntentRequest') {

// Pass IntentRequest into the dialogs.

// The odd thing is that the slots and entities structure is different. If we mix LUIS/Alexa

// it would make sense to normalize the format.

const alexaIntent = alexaMessage.request.intent;

const response = {

intent: alexaIntent.name,

entities: alexaIntent.slots,

score: 1.0

};

done(null, response);

return;

} else if (alexaMessage.request.type === 'LaunchRequest' || alexaMessage.request.type === 'SessionEndedRequest') {

// LaunchRequest and SessionEndedRequest are simply passed through as intents

const response = {

intent: alexaMessage.request.type,

score: 1.0

};

done(null, response);

return;

}

done(null, { score: 0 });

}

};

Let’s come back to the bot code. We first register our custom Alexa HTTP handler, custom recognizer, and the default response. Note our use of the custom Direct Line data. If we ask the skill something it doesn’t support, the session is terminated.

server.post('/api/alexa', (req, res, next) => {

alexaConnector.handler(req, res, next);

});

const bot = new builder.UniversalBot(connector, [

session => {

let response = 'Sorry, I am not sure how to help you on this one. Please try again.';

let msg = new builder.Message(session).text(response).speak(response).sourceEvent({

directline: {

keepSessionOpen: false

}

});

session.send(msg);

}

]);

bot.recognizer(alexaRecognizer);

Next, we create the QuoteDialog dialog. Note the following:

It reads the quote item from the entities as our Alexa skill code did.
It sends a response via the speak property but also includes a reprompt in the custom Direct Line channel data.
Within the context of this dialog, if the bot detects the AMAZON.MoreIntent, the MoreQuoteDialog dialog is invoked.
After the MoreQuoteDialog dialog executes, it yields control back to QuoteDialog.

bot.dialog('QuoteDialog', [

(session, args) => {

let quoteitem = args.intent.entities.QuoteItem.value;

session.privateConversationData.quoteitem = quoteitem;

let response = 'Looking up quote for ' + quoteitem;

let reprompt = 'What else can I help you with?';

let msg = new builder.Message(session).text(response).speak(response).sourceEvent({

directline: {

reprompt: reprompt,

keepSessionOpen: true

}

});

session.send(msg);

}

])

.triggerAction({ matches: 'QuoteIntent' })

.beginDialogAction('moreQuoteAction', 'MoreQuoteDialog', { matches: 'AMAZON.MoreIntent' });

bot.dialog('MoreQuoteDialog', session => {

let quoteitem = session.privateConversationData.quoteitem;

let response = 'Getting more quote information for ' + quoteitem;

let reprompt = 'What else can I help you with?';

let msg = new builder.Message(session).text(response).speak(response).sourceEvent({

directline: {

reprompt: reprompt,

keepSessionOpen: true

}

});

session.send(msg);

session.endDialog();

});

The same pattern is repeated for the GetAccountTypeInformationIntent intent. Lastly, we add some handlers to support things such as canceling the skill and handling the LaunchRequest and SessionEndedRequest events.

bot.dialog('CloseSession', session => {

let response = 'Ok. Good bye.';

let msg = new builder.Message(session).text(response).speak(response).sourceEvent({

directline: {

keepSessionOpen: false

}

});

session.send(msg);

session.endDialog();

}).triggerAction({ matches: 'AMAZON.CancelIntent' });

bot.dialog('EndSession', session => {

session.endConversation();

}).triggerAction({ matches: 'SessionEndedRequest' });

bot.dialog('LaunchBot', session => {

let response = 'Welcome to finance skill! I can get your information about quotes or account types.';

let msg = new builder.Message(session).text(response).speak(response).sourceEvent({

directline: {

keepSessionOpen: true

}

});

session.send(msg);

session.endDialog();

}).triggerAction({ matches: 'LaunchRequest' });

That completes our integration with Alexa. If we run the code, we will see similar behavior to the Lambda skill we had developed earlier. There are many unhandled intents and contingencies in both the bot code and the connector code, but we are well on our way to integrating the Alexa Skills Kit with Microsoft’s Bot Framework.

Exercise 14-2

Integrate Data and Quote Imagery into Bot Builder Code

In Exercise 14-1, we connected the Lambda function code to data and generated an image to render the quote on screen-enabled Echo devices. In this exercise, we will migrate both components into our Bot Builder code.

1.
Utilize the previous section’s code as a starting point.
2.
Extract the appropriate image generation code from the Lambda function and add it to your bot. Make sure you install the necessary Node.js packages.
3.
Generate the display template within the dialog and add it into your custom channel data. You can include the Alexa Skills Kit SDK for Node.js as a dependency to use the template builder types.
4.
Ensure the connector is translating the channel data template correctly into a final response back to Alexa.
5.
Run your integrated Alexa skill and Bot Framework bot and test it using the same methods you used in Exercise 14-2.
6.
What does it take to modify the bot code so that you can utilize your bot through the Bot Framework emulator? After all the knowledge you have gained in this book, you should be able to create a LUIS application to complete the experience.

What a great feeling getting this one working! It can be quite fun and interesting to develop voice chat bots, especially on a rich ecosystem like Alexa.

Conclusion

This chapter has enabled us to coalesce the learnings of this book to leverage Amazon’s Alexa platform and, additionally, integrate it with the Bot Builder SDK. A modern conversational interface can be reduced to NLU intents and entities plus a dialog engine to drive the conversation. Whether it is Alexa or other channels like Google Assistant, all these systems share common core concepts. There are those who will draw a strong enough distinction between voice and text communications to argue for a need for distinct ways of handling both interactions. Although it is true that the voice and text communications are distinct enough to warrant different front-end experiences, the ability to handle the generic idea of a conversation is well developed in the Bot Builder SDK. The idea that we can connect different NLU systems to pass their own intents into our Bot Framework bot is powerful. It means that a message into our bot can be much more than just text. It can be any kind of complex object only limited by our imagination. Granted, there is always some level of overhead to run a generic system connected to many specific interfaces, but, as we hope to have demonstrated in this chapter, the extra effort required to build the connecting layer is well within our grasp.

Footnotes

Understanding the Different Types of Alexa Skills: https://developer.amazon.com/docs/ask-overviews/understanding-the-different-types-of-skills.html

What Serverless Computing Really Means: https://www.infoworld.com/article/3093508/cloud-computing/what-serverless-computing-really-means.html

Alexa Skill Kit for Node.js: Response vs. ResponseBuilder: https://github.com/alexa/alexa-skills-kit-sdk-for-nodejs#response-vs-responsebuilder

AWS CLI: https://aws.amazon.com/cli/

Alexa Skills Kit (ASK) CLI: https://developer.amazon.com/docs/smapi/quick-start-alexa-skills-kit-command-line-interface.html

EchoSim is a browser-based interface to Alexa. It helps in testing development skills. As the Alexa test tool has improved substantially in recent months, it remains to be seen how effective of a tool EchoSim will be; see https://echosim.io .

Alexa sessions are an interesting topic that deserves more examination. More information can be found online at https://developer.amazon.com/alexa-skills-kit/big-nerd-ranch/alexa-voice-user-interfaces-and-sessions .

Account Linking Documentation: https://developer.amazon.com/docs/custom-skills/link-an-alexa-user-with-a-user-in-your-system.html . Using the Device Address API: https://developer.amazon.com/docs/custom-skills/device-address-api.html . Working with Alexa’s To Do and Shopping Lists: https://developer.amazon.com/docs/custom-skills/access-the-alexa-shopping-and-to-do-lists.html .

There has been much debate around the LITERAL slot type and its use. Amazon has tried to deprecate the slot type for some time now. It is easy to understand why. The natural language models and Alexa’s ability to prime the Automatic Speech Recognition engine by using the models are only as good as the models’ content. If some of the NLU is offloaded to a separate system, the Alexa NLU and Speech Recongition suffer. That being said, even though Amazon has espoused alternatives, the slot type has not yet been removed. See https://developer.amazon.com/post/Tx3IHSFQSUF3RQP/Why-a-Custom-Slot-is-the-Literal-Solution .

Model View Controller: https://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller

Previous Chapter

13. Chat Bot Analytics

Next Chapter

Back Matter