General

General
What is the SpeexX Ocean annotator API?

It is simply an API to annotate plain textdocuments with sematic information. To do this, it is possible to annotate a range of characters with a so called Annotation which is defined as the range of characters, the namespace, the localname and some attributes.

Is it possible to annotate PDF, Word or other documentfiles?

No, it is not possible to annotate other documents than plain text documents. To annotate well documented or propritary documents it is necessary to extract the text part of the documents and create a new document to annotate with the CoreBuilder API.

Why use namespaces and localnames for annotations?

The combination of namesapces (as URIs) and localnames defines a greate scope of indivdual annotations. There is no need of defining a central repository of annotations. Each domain is able to define there own annotations.

I want to annotate something, but I'm not sure that the annotation is correct. What can I do?

Use the weight attribute of an Annotation. The value of this attribute is an signed 32 bit integer. But it is recommended, that the weight is a range from 0 (lowest weight) to 100 (highest weight). Default is 100.

Default implementation

Default implementation
Is there a mimetype and file extension defined?

Yes, you will find the definitions in Mimetype and file extension.

Why does the default implementation use XML to serialize annotated documents?

XML is not a very good format for serializing annotated documents. XML is format free and has no build-in support for overlapping elements. But is easy to emulate overlapping.

Why not use RDF for semantic annotation serialization?

It is possible to use RDF for semantic information. But it is not usefull, cause RDF is a framework to add semantic information outside the document. It is not very usefull to use RDF inside a document.

Additionaly RDF is very heavyweight for that problem. Its not easy to store all attributes and position information in a RDF graph. And if its done the related document is not very robust. If you add or remove a character in the annotated text the position information gets invalid.

And at last RDF blows up the documents very much (much more than the default implementation).

How is overlapping done?

The serialization mechanism is emulated by a a:id and its counterpart, the a:ref attribute. All annotations have an internal ID. If there are overlappings annotations the XML elements are splitted in id- and ref-parts.

Example:

There are two annotations, one in the range of 2-6 and the second second in the range of 4 to 8. The first annotation gets the ID a1. The second annotation gets the ID a2. In XML now the first annotation elements starts at position 2 with the attribute a:id='a1'. At position 4 the second annotations starts with the attribute a:id='a2'. At position 6 all elements will close. And after closing, the second annotation will be reopen with the attribute a:ref='a2'. The deserializer rebuilds the annotation position with the help of this data.

Is there a namespace defined for the XML annotations?

Yes, there is a namespace defined for the annotations defined and the namesapce is also dereferencable under http://www.speexx.de/ocean/ns/annotation/1.0/#.

How would namespaces be serialized?

This is a quite simple mechanism. For each namespace the serializer defines a prefix which is used in the XML document. The namespace-prefix mapping will be done for all namespaces with the namespace attributes (xmlns:xyz) in the root element.

Why are there failing test with JDK 1.4.x from Sun?

At this time I don't know.

The test work fine with the JDK 1.5 and the Java SDK 1.4.2 from IBM. Both for Linux.