The goal is to provide useful computer magic when publishing a document,
losing as little of the richness of the process of producing the document as the author chooses,
supporting active reading and analysis of the document by others and
making the document semantically augmented when on a server to allow a user to do operations, such as searches and analysis of documents in aggregate.
Functionalities / Requirements
The document interchange system we are designing and, hopefully, building, should support:
• Retainment of original document attributes (such as coordinates of nodes in a space) for when opening a published document in the original application.
• Extraction of attributes to allow other applications to represent and use specialised attributes.
• Annotations should be possible to add through whatever means the user wants to, such as underlines, highlights or drawings. These annotations should then be attached to the meta-data of the document so that the user can choose to search only highlighted text for example.
• High Resolution Addressing should be possible so that the author can cite specific passages of text.
• Distributed Publishing so that if the original server link does not work, the software can present the copies of the document.
• A new form of Glossaries could be powerful in letting the reader gather a clearer understanding of the authors intention than what the author explicitly puts in to the document.
• Server Knowledge of the content of the document to allow for analysis of the document or documents in bulk, through making the data in the document clearly tagged and surfacing this meta to other applications or servers.
• Legacy access to allow users without special software to read the basic document.
The big aim is to produce a document reading, writing and publishing system which will let the reader have a richer interaction with the author’s work than interacting with the author him or herself would allow. Because this will require a new perspective of authoring – it won't be simply writing in the old way – we are calling this Socratic Authorship.
PDF as a Rich Container or HTML Encoding
This needs to be possible within legacy systems, supporting a process of publishing a document in a way which keeps the data structured for better use when someone reads the document or interacts with it as a whole document or pieces of the document. Our solution is a process of encapsulation, where the original document and an XML version is embedded inside a .pdf document so that if a reader only has a PDF reader the PDF version will be shown but if the user has the original system the original document will be presented or if the reader has software which understands the XML then that version will be shown.
If we use HTML we will need to semantically link or embed rich media, which Christopher Gutteridge has already demonstrated while working on this project, with an HTML page which encodes an image without reference to external media, though he made a rather poor choice as to what image to encode: http://lemur.ecs.soton.ac.uk/~cjg/frode.html
Christopher Gutteridge continues:
The PDF format, internally, is just a list of objects. It is possible for these to include ones that are understood by extensions to the normal format without preventing it working as normal.
Our idea is to define (or co-opt?) a mechanism for extra document data. This mechanism should consist of self-contained data-packages that could be re-used in other systems other than PDF. These extended packages of data would allow more meaning to be preserved from the authoring tool through to the end use without requiring systems to understand new file formats. PDF is the defacto standard for academic documents (sadly).
Obviously they should only contain information the author wished to share.
These additional objects inside the PDF file could be
- created by the authoring tool when the PDF is created,
- or they could be attached afterwards by a separate tool.
The second method is clunky but useful for bootstrapping the idea and testing/demoing.
These objects can be used in three (or more) ways;
- using a PDF viewer that understands them (expensive to do for a demo)
- detecting them in online archive software and making them available to the viewer as stand alone information. Either rendered into web-content or as a simple download.
- detecting them with crawlers and similar software to create complex search tools that can understand the semantic information in the document.
The encoding inside such a package isn't set in stone, nor need it be. What ever we do first will be wrong in some way.
Rather than make lots of specialised dataset types, I would rather start with some very simple formats that are easy to work with then add more semantic meaning with additional datasets.
Data too large to embed could be linked by reference to external files using URL, DOI or other mechanism.
I'm going to use, as an example, an academic paper containing a graph showing the height of 5 different rivers as a line graph, plotting time vs height above river average.
In addition to the normal PDF content, our file contains some extra objects:
- a simple csv file with the headings: RIVER, 2017-01-01, 2017-01-08, 2017-01-15, 2017-01-22 ... (and so on). There is five more rows, one for each river.
- an object which links the csv file with the graph in the visible document (with enough extra information that a smart pdf viewer could re-render the graph as interactive)
- an object which explains the semantics of the csv file. Linking each river name to some global ID for that river, documenting the error rate, and that the values in each cell are the maximum value in the week starting at the given date in EST time (because midnight matters). It says the units are length (cm).
This lets a very simple discovery tool say "hey, I found me as CSV" and allow the reader to view or work with the information.
What is achievable in a year:
- design a v0.1 format for this data (or co-opt any existing work)
- write a tool to attach it to a PDF
- add a plugin to a research repository that can render the graph on the page for the document, eg a page like this: https://eprints.soton.ac.uk/270885/ - it could also easily allow the CSV to be downloaded.
What would be cool but more work:
- invent more useful datasets to embed in a document
- make a desktop PDF viewer aware of some of these datasets
- make a (web based?) tool which can pull semanticly marked-up data out of several papers and combine it using what it's been told about error factors, units etc.
- embed the same datasets into other container-style formats. Video? (mp4, avi etc are just containers, I think). HTML should be doable.