With support of the Hadoop streaming API it is possible to read in XML. In the εοs-toolkit the streaming API is used to convert Medline documents in EosDocuments in a cluster. The streaming API is not easy to use.
First: the implementation (Hadoop 0.16.2) propagates not the correct record to the mapper process. I must adjust the record by removing a possible open or closing parent XML element.
Second: the Mapper doesn't get the value in the value parameter of the map-method. The value comes in the key part of the map-method. I don't find any documentation of this behavior in the Hadoop docs. Only a look to the code in StreamXmlRecordReader shows it.
See εοs-toolkit converter contribution for more information.
Geschrieben von sascha am 14. April 2008 00:11:23 CEST
