module ULB:sig..end
unicode_lexbuf record with
access functions. In this record, the data is available
in two forms: As an array of Unicode code points
ulb_chars, and as string of encoded chars ulb_rawbuf.
Both buffers are synchronised by ulb_chars_pos. This
array stores where every character of ulb_chars can be
found in ulb_rawbuf.type unicode_lexbuf = private {
|
mutable ulb_encoding : |
(* |
The character encoding of
ulb_rawbuf | *) |
|
mutable ulb_encoding_start : |
(* |
The first character position to which
ulb_encoding
applies (the encoding of earlier positions is
lost) | *) |
|
mutable ulb_rawbuf : |
(* |
The encoded string to analyse
| *) |
|
mutable ulb_rawbuf_len : |
(* |
The filled part of
ulb_rawbuf | *) |
|
mutable ulb_rawbuf_end : |
(* |
The analysed part of
ulb_rawbuf. We have always
ulb_rawbuf_end <= ulb_rawbuf_len. The analysed part
may be shorter than the filled part because there is
not enough space in ulb_chars, or because the filled
part ends with an incomplete multi-byte character | *) |
|
mutable ulb_rawbuf_const : |
(* |
Whether
ulb_rawbuf is considered as a constant. If
true, it is never blitted. | *) |
|
mutable ulb_chars : |
(* |
The analysed part of
ulb_rawbuf as array of Unicode
code points. Only the positions 0 to ulb_chars_len-1
of the array are filled. | *) |
|
mutable ulb_chars_pos : |
(* |
For every analysed character this array stores the
byte position where the character begins in
ulb_rawbuf.
In addition, the array contains at ulb_chars_len the
value of ulb_rawbuf_end.
This array is one element longer than | *) |
|
mutable ulb_chars_len : |
(* |
The filled part of
ulb_chars | *) |
|
mutable ulb_eof : |
(* |
Whether EOF has been seen
| *) |
|
mutable ulb_refill : |
(* |
The refill function
| *) |
|
mutable ulb_enc_change_hook : |
(* |
This function is called when the encoding changes
| *) |
|
mutable ulb_cursor : |
(* |
Internally used by the implementation
| *) |
val from_function : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
refill:(Bytes.t -> int -> int -> int) ->
Netconversion.encoding -> unicode_lexbufunicode_lexbuf to analyse strings of the
passed encoding coming from the refill function.
raw_size : The initial size for ulb_rawbuf. Defaults to 512char_size : The initial size for ulb_chars. Defaults to 256enc_change_hook : This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding.refill : This function is called with arguments ulb_rawbuf,
ulb_rawbuf_len, and l, where
l = String.length ulb_rawbuf - ulb_rawbuf_len is the free
space in the buffer. The function should fill new bytes into
this substring, and return the number of added bytes. The
return value 0 signals EOF.val from_in_obj_channel : ?raw_size:int ->
?char_size:int ->
?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding ->
Netchannels.in_obj_channel -> unicode_lexbufunicode_lexbuf to analyse strings of the
passed encoding coming from the object channel.
raw_size : The initial size for ulb_rawbuf. Defaults to 512char_size : The initial size for ulb_chars. Defaults to 256enc_change_hook : This function is called when the encoding
is changed, either by this module, or by the user calling
set_encoding.val from_string : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> string -> unicode_lexbufunicode_lexbuf analysing the passed string encoded in
the passed encoding. This function copies the input string.
enc_change_hook : This function is called when the encoding
is changed, either by this module, or by the user calling
set_encodingval from_bytes : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Bytes.t -> unicode_lexbufval from_bytes_inplace : ?enc_change_hook:(unicode_lexbuf -> unit) ->
Netconversion.encoding -> Bytes.t -> unicode_lexbufunicode_lexbuf analysing the passed string encoded in
the passed encoding. This function does not copy the input string,
but uses it directly as ulb_rawbuf. The string is not modified by ULB,
but the caller must ensure that other program parts do not
modify it either.
enc_change_hook : This function is called when the encoding
is changed, either by this module, or by the user calling
set_encodingfrom_string_inplace, this function has been removed
as strings are now considered immutable.val delete : int -> unicode_lexbuf -> unitunicode_lexbuf.
These characters
are removed from the beginning of the buffer, i.e.
ulb_chars.(n) becomes the new first character of the
buffer. All three buffers ulb_rawbuf, ulb_chars, and
ulb_chars_pos are blitted as necessary.
When the buffer is already at EOF, the function fails.
For efficiency, it should be tried to call delete as seldom as
possible. Its speed is linear to the number of characters to move.
val refill : unicode_lexbuf -> unitunicode_lexbuf by calling the
ulb_refill function. When the buffer is already at EOF, the
exception End_of_file is raised, and the buffer is not modified.
Otherwise, the ulb_refill function is called to
add new characters. If necessary, ulb_rawbuf, ulb_chars, and
ulb_chars_pos are enlarged such that it is ensured that either
at least one new character is added, or that EOF is found for
the first time
In the latter case, ulb_eof is set to true (and the next call
of refill_unicode_lexbuf will raise End_of_file).val set_encoding : Netconversion.encoding -> unicode_lexbuf -> unitencoding to the passed value. This only affects future
refill calls. The hook enc_change_hook is invoked when defined.val close : unicode_lexbuf -> unitulb_eof of the unicode_lexbuf. The rest of the buffer
is not modifiedval utf8_sub_string : int -> int -> unicode_lexbuf -> stringint arguments are the position and length of a sub
string of the lexbuf that is returned as UTF8 string. Position
and length are given as character multiples, not byte multiples.val utf8_sub_string_length : int -> int -> unicode_lexbuf -> intString.length(utf8_sub_string args). Tries not to
allocate the UTF-8 string.