|
http://purl.bdrc.io/ontology/admin/technicalComment
|
The :Etext :Entity represents a searchable Unicode text file ("eText"). In BUDA, eTexts are comprised of a base layer + at least one archival annotation layer, plus other annotation layers for different uses. The base layer is a raw Unicode text file. The base layer contains Unicode text and line breaks that form a coordinate system. The coordinate system includes a sequence number, position of start character and end character. The coordinate system is defined in the archival annotation layer.
The RDF representation of :Etext is derived from the base layer, a graph definining the coordinate system :EtextNonPaginated or :EtextPaginate, and a graph for the :ItemEtext that references the volume of the :ItemEtext.
There will also be information such as a URI to the original source file, if applicable. For Tibetan, these typically are Sambhota Word or Tibet Doc files.
The reference to the containing :ItemEtext is a URI formed by appending the volume index to the :ItemEtext URI, like:
:eTextInVolume bdr:I1KG443_003/volume/3
This URI refers to a volume blank node contained in the referenced Item.
Search chunks in the :Etext are derived from the original Unicode eText stream by breaking the stream at such locations to minimize the likelihood that the likely search terms will be broken across chunk boundaries. In Tibetan this is achieved by breaking the stream after the nearest shad codepoint following something like 300 tsegs. This assumes that most searches will not extend beyond a phrase boundary delimited by a shad (note there are actually a family of codepoints that count as shad in the Tibetan Unicode plane).
Each chunk is a blank node containing the chunk id - which is a virtual id serving only to order the sequence of contents to reassemble the original eText content, like:
[ a EtextChunk ;
:eTextChunk 3 ;
:chunkContents ". . . the text contained in chunk 3 . . ." ]
If the chunk is from an OCR'd text, then there is additional information indicating the starting chunk and character, and ending chunk and character for each page in the :ItemImageAsset from which the OCR'd material was derived. For example,
[ a EtextPage ; :slice 46 ;
:sliceStartChunk 43 ;
:sliceStartChar 17 ;
:sliceEndChunk 44 ;
:sliceEndChar 138 ]
The character counts are in terms of Unicode codepoints in whatever representation is used, typically, UTF-8.
|