Editors Draft IN PROGRESS
Copyright © 2010, 2011 International Digital Publishing Forum™
Table of Contents
This specification, EPUB Media Overlays 3.0, defines a usage of SMIL, OPF, CSS, and the EPUB Content Document format for representation of synchronized text and audio publications.
While is intended that this specification be able to be reused by other applications, the illustrative examples herein are based on its use by EPUB, and certain conformance requirements specific to EPUB are normatively defined. However, such EPUB-specific conformance requirements are denoted as being applicable in the context of EPUB, and may be disregarded by other applications that wish to conform to the content and Reading System conformance requirements of this specification.
This document is meant to be read and understood in concert with the other documents that make up EPUB3. The EPUB3 Overview [EPUB3Overview], which provides an informative overview of EPUB and a roadmap to the rest of the EPUB3 documents, should be read first.
This specification relies on the SMIL 3.0 specification, from which the EPUB SMIL subset is derived.
TODO: reference SMIL 3This specification supersedes ...
TODO: enumerate other dependent specs
A CSS Style Sheet conformant to this specification.
An XHTML Content Document or SVG Content Document as defined in EPUB_ContentDocs30.
A User Agent as defined in HTML5 that processes Content Documents and EPUB Style Sheets in a manner conformant with the EPUB specification.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].
All sections of this specification are normative except where identified by the informative status label "This section is informative". The application of informative status to sections and appendices applies to all child content and subsections they may contain.
All examples in this specification are informative.
Media overlays are a new feature in EPUB 3. This document, the media overlays specification, defines a usage of SMIL, OPF, the EPUB Style Sheet, and the EPUB Content Document for representation of synchronized text and audio publications.
Books featuring audio narration synchronized with the text can be found today in mainstream e-book software, educational tools, and e-books formatted for persons with print disabilities. In EPUB 3, these types of books are created by adding to the EPUB fileset an overlay file describing the timing for the pre-recorded audio narration and how it relates to the text markup. The file format for the overlay itself is SMIL, a W3C standard for representing synchronized multimedia information in XML.
It is important to note that the media overlays feature is designed to be transparent and therefore will not break compatibility with text-only reading systems. It is also important to note that while future versions of this specification may incorporate support for video media (e.g. synchronized text/sign-language books), this version only supports text and audio media.
SMIL (Synchronized Multimedia Integration Language) is a W3C recommendation for describing multimedia presentations in XML. This specification defines media overlays as a subset of SMIL 3.0. The purpose of a media overlay is to define synchronization of audio clips with an EPUB Content Document.
Each phrase in the EPUB Content Document is represented by a SMIL
par element. That par element must
contain two media element children: a text element and an
audio element. The text element
represents a phrase, sentence, or other segment of the content document, and
that segment is referred to by its URI. The audio element represents an audio clip, consisting of
an audio file, given by the src attribute, and clip offsets,
given by clipBegin and clipEnd attributes.
<par>
<text src="chapter1.html#sentence1"/>
<audio src="chapter1_audio.mp3" clipBegin="23s" clipEnd="32s"/>
</par> The ordering of the par elements shall match the default
reading order of the EPUB Content Document. An example of a basic SMIL document
describing an audio overlay for an EPUB book is shown here:
<smil>
<body>
<par id="par1">
<text src="chapter1.html#sentence1"/>
<audio src="chapter1_audio.mp3" clipBegin="0s" clipEnd="10s"/>
</par>
<par id="par2">
<text src="chapter1.html#sentence2"/>
<audio src="chapter1_audio.mp3" clipBegin="10s" clipEnd="20s"/>
</par>
<par id="par3">
<text src="chapter1.html#sentence3"/>
<audio src="chapter1_audio.mp3" clipBegin="20s" clipEnd="30s"/>
</par>
</body>
</smil>The examples above show a overlays for very simple EPUB Content Documents
that contain no nested text containers, such as sections, asides, headers,
footnotes, and so on. To represent these structures, a SMIL seq element
(sequence) must be used. Its children must be other seq elements or par
elements. Each seq element must contain an epub:textref attribute which
references, by URI, the corresponding content document element.
Here is an example of a SMIL document with nested seq elements representing a
section header and a sidebar with a nested image group:
<smil>
<body>
<seq id="id1" epub:textref="chapter1.html#sectionheader">
<par id="id2">
<text src="chapter1.html#section1_title"
<audio src="chapter1_audio.mp3" clipBegin="0:23:23.84" clipEnd="0:23:34.221"/>
</par>
</seq>
<par id="id3">
<text src="chapter1.html#text1"/>
<audio src="chapter1_audio.mp3" clipBegin="0:23:34.221" clipEnd="0:23:59.003"/>
</par>
<par id="id4">
<text src="chapter1.html#text2"/>
<audio src="chapter1_audio.mp3" clipBegin="0:23:59.003" clipEnd="0:24:15.000"/>
</par>
<seq id="id5" epub:textref="chapter1.html#sidebar">
<par id="id6">
<text src="chapter1.html#sidebartitle"/>
<audio src="chapter1_audio.mp3" clipBegin="0:24:15.000" clipEnd="0:24:18.123"/>
</par>
<seq id="id7" epub:textref="chapter1.html#imagegroup">
<par id="id8">
<text src="chapter1.html#photo"/>
<audio src="chapter1_audio.mp3" clipBegin="0:24:18.123" clipEnd="0:24:28.764"/>
</par>
<par id="id9">
<text src="chapter1.html#photo_caption"/>
<audio src="chapter1_audio.mp3" clipBegin="0:24:28.764" clipEnd="0:24:50.010"/>
</par>
</seq>
<par id="id10">
<text src="chapter1.html#sidebartext3"/>
<audio src="chapter1_audio.mp3" clipBegin="0:24:50.010" clipEnd="0:25:28.530"/>
</par>
<par id="id11">
<text src="chapter1.html#sidebartext4"/>
<audio src="chapter1_audio.mp3" clipBegin="0:25:28.530" clipEnd="0:25:45.515"/>
</par>
<seq>
</body>
</smil>Here is the corresponding text document for the SMIL example above:
TODO: exampleSMIL text elements' src attribute values refer to content document elements
by using their IDs. The granularity level of the SMIL presentation therefore
depends on how the content document is marked up. If the finest level of
markup is at the paragraph level, then that is the finest possible level at
which the media overlay synchronization can be authored. Likewise, if
sub-paragraph markup is available, such as span elements wrapping phrases or
sentences, then a finer level of granularity is possible in the media
overlay.
Any EPUB Content Document with which a given media overlay is associated
may contain embedded media objects such as video and audio. SMIL text
element may refer to an embedded video or audio element by its ID
value.
Visual rendering information for the currently-playing EPUB Content Document
text element may be expressed using the CSS pseudo class media-overlay-active.
This could be a highlight or outline or other indication that the text element
is "active".
In order to express semantics, the epub:type attribute may be present in on
SMIL par and seq elements. Its values must be taken from the vocabulary defined
for the publication. This attribute facilitates intelligent decisions by the
user agent regarding playback behavior appropriate for the semantic type(s)
indicated.
TODO: example of SMIL using epub:type
head
The head element is the container for metadata in the
SMIL media overlay file.
None.
One or more:
smil-meta
[TODO]
meta
The meta element represents metadata for the SMIL media overlay.
content [TODO]name [TODO]property [TODO]about [TODO]Text
TODO: Content model is text only if property/about are used instead of content/name attrs.body
The body element is the starting point for the
presentation contained in the SMIL media overlay file. It
represents a sequence.
epub:type [optional]One or more values taken from the EPUB vocabulary. TODO: reference epub vocabulary section
id
[optional]Refer to [TODO link to attrdef-common-id]
seq
The seq element represents a SMIL playback
sequence.
epub:type [optional]One or more values taken from the EPUB vocabulary. TODO: reference epub vocabulary section
id
[optional]Refer to [TODO link to attrdef-common-id]
epub:textref
[required]URI of corresponding EPUB Content Document element. Must use a fragment identifier to refer to a specific element.
par
The par element contains media objects which are to be played in parallel.
epub:type [optional]One or more values taken from the EPUB vocabulary. TODO: reference epub vocabulary section
id
[optional]Refer to [TODO link to attrdef-common-id]
smil-text
[optional],
smil-audio
[required]
text
The text element represents text media by referring to an element in an EPUB Content Document file.
src [required]URI with fragment identifier
Empty.
audio
The audio element represents a clip of audio media.
src [required]URI of an audio file TODO: reference audio format requirements
clipBegin [optional]Clock value expressed in either hh:mm:ss.fraction or
as a single unit, whre units must be one of
h (hours), min
(minutes), s (seconds), or
ms (milliseconds). Examples:
5:34:31.396
124:59:36
0:05:01.2
76.2s
3.2h
clipEnd [optional]Clock value expressed in either hh:mm:ss.fraction or
as a single unit, whre units must be one of
h (hours), min
(minutes), s (seconds), or
ms (milliseconds).
Examples:
5:34:31.396
124:59:36
0:05:01.2
76.2s
3.2h
Must be chronologically after clipBegin.
Empty.
Manifest items in the publication's Package Document may specify a media overlay
for that item via the media-overlay attribute. Media overlays are
themselves manifest items and must be referred to by their IDs. For example:
<manifest>
<item id="ch1" href="chapter1.html" media-type="application/xhtml+xml" media-overlay="ch1_audio"/>
<item id="ch1_audio" href="chapter1_audio.smil" media-type="application/smil+xml"/>
</manifest>Manifest items which refer to SMIL media overlays must have the media-type
application/smil+xml.
While not every manifest item is required to have a media overlay associated with it, there must be a one-to-one relationship between media overlay files and manifest items; in other words, multiple manifest items cannot share a single media overlay file.
This is a forwards-compatible addition: 2.0 reading systems may safely ignore the media-overlay attribute and process documents in their normal fashion.
User agents may support media overlays, and if they do, then they must adhere to
the conformance requirements in this section. User agents that do not support media
overlays shall ignore the media-overlay attribute on manifest items and shall also
ignore the media overlay manifest items.
When the user agent loads an EPUB Content Document, it shall refer to the manifest item for that content document to see if it has a corresponding media overlay. If it does, then it looks up the SMIL media overlay by ID in the manifest and loads the SMIL file. Playback shall start either at the beginning or at a specific location within the file. When the SMIL file has finished playing, the user agent shall proceed, following the order of the spine to determine the next EPUB Content Document, and using the method described above, locate the corresponding media overlay for that next document.
User agents must support the EPUB SMIL subset. This is a subset of SMIL 3.0
plus two attributes defined in this standard, epub:textref and epub:type.
The SMIL elements associated with synchronization behavior are called
seq (sequence) and par (parallel).
A SMIL media overlay is, in its simplest form, defined as a sequence of
parallel (i.e. rendered together) text and audio media objects. User agents
shall render immediate children of the SMIL body element in a sequence. Each
child element must be a seq or a par
element. A seq element's children must be rendered in
sequence, and playback completes when the last child has finished playing. A
par element's children must be rendered starting at
the same time, and playback completes when the all children have finished
playing. When the SMIL body element's last child has
finished playing, playback of the file is done.
When presented with a SMIL audio element, user agents must play the audio
file referenced by the src attribute, starting at the time given by
clipBegin attribute and ending at the time given by the clipEnd attribute.
The following rules shall be observed:
If clipBegin is not specified, its value is assumed to be
0
If clipEnd is not specified, its value is assumed to be the
end of the physical media
If clipEnd exceeds the duration of the physical media, then
its value is assumed to be the end of the physical media
User-controllable audio playback options should include timescale modification, where the playback rate is altered without distorting the pitch. The suggested range is half-speed to double speed.
When presented with a SMIL text element, user agents must ensure the EPUB
Content Document element referenced by the src attribute is visible. User
agents must apply the styling rules in the CSS pseudo class
media-overlay-active to this EPUB Content Document element.
The SMIL media overlay is closely linked to the EPUB Content Document. The
content document structure is mimicked in the SMIL file. The content
document text IDs are used in the SMIL text elements'
src attributes and the seq
elements' epub:textref attributes. This allows for the
SMIL media overlay playback to closely follow user navigation of the text
because it is very easy to locate a text reference in the SMIL file.
If the user pauses synchronized text/audio playback and navigates to a different part of the document, synchronized text/audio playback must resume at that point. For example, if a specific page number in the content document is the desired location, then this same point is located in the SMIL media overlay and playback started there.
This same approach allows for synchronizing the SMIL playback with user selection of a navigation points in the publication's global navigation (NCX). The user agent loads the media overlay for that file and finds the correct point for starting playback based on the ID of the navigation point target.
Any EPUB Content Document with which a given media overlay is associated may contain embedded media objects such as video and audio. Unlike text and images, such content type is said to be "continuous" in the sense that it contains its own timing information (i.e. audio and video clips have an intrinsic duration). Consequently, when a reading system renders the text/audio synchronization described by a media overlay, the default playback behaviors of audio and video media embedded within the associated text document must be overridden.
All audio and video media objects embedded within an EPUB Content Document must have their public playback interface deactivated (typically: play/pause control, time scrobbler, volume level, etc.). This is needed to avoid interference between the scheduled playback sequence defined by the media overlay, and the arbitrary playback behavior due to user interaction or script execution. This means that while the SMIL audio/text synchronization is in playback mode: the reading system must:
Hide the individual video/audio UI controls from the
page. This overrides the default behavior defined by the
controls HTML5 attribute.
Prevent scripts embedded within the EPUB document (i.e. authored as part of the default publication behavior) from invoking the JavaScript audio/video playback API. Because this may be hard to implement in practice, it is recommended that content producers should avoid publishing embedded scripts dedicated to controlling the playback of inline audio/video media objects, so that the published media overlay can retain full control of the synchronized text/audio presentation, without any risk of interference with script-enabled custom behaviors.
All audio and video media objects embedded within an EPUB Content
Document must be initialized to their "stopped" state, and ready to
be played from the zero-position within their content stream
(possibly displaying the poster image specified using the XHTML5
markup). This overrides the default behavior defined by the
autoplay HTML5 attribute.
When a text element becomes active in the media overlay, the CSS
visual highlighting rules apply regardless of the content type
referred to by the src attribute. In other words, visible video
and audio player controls within the host EPUB Content Document must
be decorated as per the media-overlay-active CSS styling rules.
In addition to the above default behavior for SMIL activation of text fragments, audio and video playback must be started and stopped according to the duration implied by the authored SMIL synchronization (as per the standard SMIL timing model). There are two possible scenarios:
When a text element in the SMIL
markup has no audio sibling within
its par parent container, the
referenced audio or video media object must play until
it ends, at which point the SMIL text
element's lifespan terminates. In other words, the
implicit duration of the SMIL text
element (and by inference, of the parent
par container) is that of the
referenced audio or video clip.
When a text element in the SMIL
markup has an audio sibling within
its par parent container, the
playback duration of the audio or video media object
referenced by the text element must
be constrained by the duration of the
audio sibling in the SMIL media
overlay. In other words, the actual duration of the
parent par container is that of the
child audio clip, regardless of the duration of the
video or audio media pointed to by the
text element. This may result in
an embedded video or audio media object to end playback
prematurely (before reaching its full duration), or to
end before the playback of the parallel
audio in the SMIL markup is
finished (in which case the last-played video frame
should remain visible until the parent
par container finally ends). This
is equivalent of the audio element in
the SMIL markup implicitly carrying the behavior of the
endsync attribute.TODO: reference SMIl 3
endsync.
When a text element becomes inactive in the SMIL media overlay,
and when it points to an video or audio media object, the referenced
media object must be reset to its initial "stopped" state, and ready
to be played from the zero-position within their content stream
(possibly displaying the poster image specified using the HTML5
markup)
While reading, users may want to turn on or off certain features of the
publication, such as sidebars, footnotes, page numbers, or other types of
secondary content. This feature is called "skippability". User agents should use
the semantic information provided by SMIL elements' epub:type attribute to
determine when to offer users the option of skippable features. The decision of
whether to allow skippability or not for a given epub:type value is left to the
user agent and its familiarity with the EPUB vocabulary in use in that
publication.
Escapable items are SMIL representations of nested structures such as tables,
lists, and sidebars that users listening to the media overlay may wish to skip
and continue reading what comes next. This is different from the skippability
described above, which enables or disables entire classes of items. In this
scenario, a user has started listening to the audio for a book, encounters a
table, and wishes to skip it. User agents should allow escaping of nested
structure items. User agents shall determine the start of nested structures by
their epub:type attribute (e.g. "table") and should offer users the
option to skip playback of that structure and resume with whatever comes after
it.
This specification defines a subset of SMIL 3.0 and adds two attributes defined in this standard, epub:textref and epub:type.
TODO: reference EPUB media overlay schemaThis appendix is informative
This specification has been developed through a cooperative effort, bringing together publishers, vendors, software developers, and experts in the relevant standards.
Version 3.0 of this specification was prepared by the International Digital Publishing Forum’s EPUB Maintenance Working Group. Active members of the working group at the time of publication of revision 2.0.1 were:
TODO 3.0 contributors listVersion 2.0.1 of this specification was prepared by the International Digital Publishing Forum’s EPUB Maintenance Working Group. Active members of the working group at the time of publication of revision 2.0.1 were:
| Garth Conboy eBook Technologies Working Group Vice-chair |
| George Kerscher DAISY Consortium Working Group Chair |
| Alexis Wiles OverDrive |
| Alicia Wise Publishers Licensing Society |
| Amir Moghaddami National Library and Archives of Iran |
| Andreas Gosling Penguin UK |
| Andy Williams Cambridge University Press |
| Anupam Jain Innodata Isogen |
| Ben Trafford Invited Expert |
| Benoit Larroque Feedbooks |
| Bill McCoy Invited Expert |
| Bill Kasdorf Apex CoVantage |
| Bob Kasher The BookMasters Group |
| Brady Duga eBook Technologies |
| Byron Laws PreMedia Global |
| Catherine Zekri University of Montreal |
| Chris Kennedy Pearson Education |
| Corey Podolsky Entourage Systems Inc. |
| Cristina Mussinelli AIE |
| Daihei Shiohama Voyager Japan |
| Dan Amos DNAML |
| Dan Galperin Kobo |
| Dan Kok Crossway Books and Bibles |
| Dave Cramer Hachette Book Group USA |
| Dave Gunn RNIB Centre for Accessible Information |
| David Mandelbaum Barnes&Noble.com |
| Deidra Roberts World Health Organization |
| Donald Goyette McGraw-Hill Professional |
| Eric Freese Aptara |
| Eric Gold Digital Divide Data |
| Eric Muller Adobe |
| Gregory Shepherd Cengage Learning |
| Guy Fain Crossway Books & Bibles |
| Hadrien Gardeur Feedbooks |
| Hisashi Hoda Voyager Japan |
| Ignacio Fernández Galván |
| Israel Viente Mendele He-Books |
| Jim Link Macmillan Publishing Solutions |
| James MacFarlane Easypress Technologies |
| Jim Rura Educational Testing Service |
| John Crossman Benetech |
| John Prabhu HOV Services |
| John Rivlin eBook Technologies |
| John Wait Pearson Education |
| Jon Noring Invited Expert |
| Joshua Tallent eBook Architects |
| Karen Broome Sony |
| Keith Fahlgren Threepress Consulting |
| Kenny Johar Vision Australia |
| Laurie Casey Pearson |
| Lech Rzedzicki Pearson UK |
| Liisa McCloy-Kelley Random House |
| Lindy Humphreys Wiley/ Blackwell Books |
| Liza Daly Theepress Consulting |
| Makoto Murata JEPA EPUB Study Group |
| Marco Croella Simplicissimus Book Farm |
| Markus Gylling DAISY Consortium |
| Mattias Karlsson Dolphin Computer Access AB |
| Michael Smith IDPF |
| Neil Soiffer Design Science |
| Noah Genner BookNet Canada |
| Pat Pagano HarperCollins |
| Patricia Payton Bowker |
| Patrick Barry The Educational Company of Ireland |
| Patrick Berube LEARN |
| Paul Durrant Durrant Software Limited |
| Paul Norton Invited Expert |
| Penelope Reid EPUB User Group (UK) |
| Perce Huang Far EasTone Telecommunications |
| Peter Brantley Internet Archive |
| Peter Sorotokin Adobe |
| Richard Heiberger HarperCollins Publishers |
| Richard Kwan Invited Expert |
| Russell White Random House |
| Samir Kakar Aptara |
| Satya Pamarty codeMantra |
| Scott Cook codeMantra |
| Sean Ramsey LibreDigital |
| Siobahn Padgett Hachette BG USA |
| Steve Arany John Wiley & Sons |
| Takeshi Kanai Sony |
| Thad Swiderski LibreDigital |
| Tim Middleton BookNet Canada |
| Trudy Conti Follett |
| Tyler Ruse LibreDigital |
| William Howard EasyPress Technologies |
Version 1.0 of this specification was prepared by the International Digital Publishing Forum’s Unified OEBPS Container Format Working Group. Active members of the working group at the time of publication of revision 1.0 were:
| Garth Conboy eBook Technologies Working Group Co-Chair |
| John Rivlin eBook Technologies Working Group Co-Chair |
| Jon Ferraiolo IBM Working Group Vice-Chair |
| Nick Bogaty IDPF Working Group Secretary |
| Kelley L. Allen Random House |
| Angel Ancin iRex Technologies |
| Ryan Bandy Random House |
| Richard Bellaver Ball State University |
| Thierry Brethes Mobipocket |
| Janice Carter Benetech/Bookshare.org |
| Richard Cohn Adobe Systems Inc. |
| Neil De Young Hachette Book Group USA |
| Linh N. Do Random House, Inc. |
| Geoff Freed WGBH |
| Liang Gang TriWorks Asia |
| Peter Ghali Motricity, ereader.com |
| Markku T. Hakkinen DAISY Consortium |
| Gillian Harrison NetLibrary |
| Jonathan Hevenstone Publishing Dimensions |
| Theresa Horner HarperCollins |
| Karen Iannone Houghton Mifflin |
| Claire Israel Simon & Schuster |
| Mattias Karlsson Dolphin Computer Access |
| Bill Kasdorf Apex Publishing |
| George Kerscher DAISY Consortium |
| Steve Kotrch Simon & Schuster |
| Bill McCoy Adobe Systems, Inc. |
| Bill McKenna Follett |
| Bonnie Melton Houghton Mifflin College Division |
| Jon Noring OpenReader Consortium Invited Expert |
| Sayu Osayande Motricity, ereader.com |
| Lee Passey Invited Expert |
| Steve Potash OverDrive |
| Tyler Ruse Codemantra |
| Mike Smith Harlequin |
| Kimi Sugeno John Wiley & Sons |
| Gary Varnell Osoft.com |
| Xin Wang, Ph.D. ContentGuard, Inc. |
| Andrew Weinstein Lightning Source |
| Tom Whitcomb NetLibrary |
| Andy Williams Cambridge University Press |
| Eli Willner Green Point Technology Services |
[EPUB3Overview] EPUB3 OverviewTODO fix link.
[RFC2119] Key words for use in RFCs to Indicate Requirement Levels (RFC 2119) . March 1997.