This page describes platform-specific JSSCxml extensions related to speech interaction in supported Web browsers. These extensions are not intended to be always interoperable with other SCXML implementations. JSSCxml currently supports only speech synthesis out of the box.

the `<speak>` element

The <speak> element can be used wherever executable content is allowed. It connects SCXML to the Web Speech Synthesis API newly implemented in several Web browsers, and exposes most of its functionnality.

Datamodel fields

In the datamodel, _x.voices is the list of voices supported by the platform, as returned by the SpeechSynthesis.getVoices() method. Items of that list are not voiceURIs, but the full SpeechSynthesisVoice objects with a voiceURI property.

Namespace

The namespace for <speak> must be "http://www.jsscxml.org", for which I suggest the shorthand "jssc". Thus:

<scxml xmlns="http://www.w3.org/2005/07/scxml" xmlns:jssc="http://www.jsscxml.org">
…
<jssc:speak text="Hello world!" xml:lang="en-US"… />
…

The reason JSSCxml is not using the SSML namespace is that there are some new attributes on the element, and that using inline SSML content would be good-looking but completely inflexible. Instead, SSML documents can be created (or parsed from actual SSML content in a <data> element) and manipulated by ECMAScript code and finally passed to the <speak> element.

Attribute detail

Name	Required	Default value	Valid values	Description
text	yes*, and no more than one of those two	none	text with optional SSML tags	The text or SSML that will be spoken
expr	yes*, and no more than one of those two	none	an expression evaluating to a string or a SSML Document	Evaluates when the `<speak>` element is executed, used as if there had been a `text` attribute with the resulting (linearized) value.
xml:lang	no, and only one of those two	plaform-specific	any RFC 3066 language code supported by the platform	the language of the text to be read
langexpr	no, and only one of those two	none		Evaluates when the `<speak>` element is executed, used as if there had been an `xml:lang` attribute with the resulting value.
voice	no	plaform-specific	expression that must evaluate to a member of _x.voices	the voice used to read the text
volume	no	1	0 – 1	how loud the text will be spoken
rate	no	1	0.1 – 10	how fast the text will be spoken
pitch	no	1	0 – 2	pitch modifier for the synthesized voice
interrupt / nomore	no*	false	boolean	stops speaking and cancel queued utterances

* The boolean nomore (or interrupt) may appear alone in the <speak> tag.

Note that the DOMParser API used by JSSCxml to parse all this may reject SCXML documents where boolean attributes are written without a value. It is therefore advised to give them the value "true" (although the interpreter will be happy with any value at all, as long as it gets well-formed XML).

Children

None.

Behavior

When executed, the <speak> element causes its text or SSML content to be read by the platform's SpeechSynthesis implementation, using the supplied parameters. When the synthesizer has reached the end of the utterrance, a speak.end event will be placed in the externa queue, with its data field containing a reference to the underlying SpeechSynthesisUtterance object.

If the nomore or interrupt attribute is present, current and queued utterances will be cancelled first, so the new utterance (if supplied) will be spoken immediately, no matter what. The interpreter will place a speak.error event in the external queue for each cancelled utterance.

At this time, Chrome and Safari's implementations disagree on the way to select a voice. Chrome's utterance objects have a voiceURI property which can be set to the voiceURI value of a voice, whereas Safari's utterance objects have a voice property which accepts only references to whole SpeechSynthesisVoice objects. In order to hide this misbehavior from authors, the voice attribute defined here always takes a reference, and JSSCxml will ensure that each browser gets what it expects.

If no voice is specified, the xml:lang attribute will cause the platform to choose the default voice for that language, if any is available, or at least for another geographical variation of that language. The language defined by xml:lang higher in the document hierarchy (typically on the root element) is inherited by <speak> elements, so there is no need to repeat it all the time.

`speak` events

speak.* events queued when using speech synthesis have the DOM Event origintype, but their origin is the corresponding SpeechSynthesisUtterance object rather than a node. There is no reason to <send> any event back to those objects (and the interpreter won't take them as a valid target anyway), but their text property allows you to track which utterance it is that has started, ended, or been cancelled.

The event's data will contain the elapsedTime, charIndex, and name properties of the original DOM event instead of a copy of the event itself, as would be the case for DOM events converted in the usual way by JSSCxml.

Web Speech API integration

the <speak> element