C4::Charset - utilities for handling character set conversions.
use C4::Charset;
This module contains routines for dealing with character set conversions, particularly for MARC records.
A variety of character encodings are in use by various MARC standards, and even more character encodings are used by non-standard MARC records. The various MARC formats generally do not do a good job of advertising a given record's character encoding, and even when a record does advertise its encoding, e.g., via the Leader/09, experience has shown that one cannot trust it.
Ultimately, all MARC records are stored in Koha in UTF-8 and must be converted from whatever the source character encoding is. The goal of this module is to ensure that these conversions take place accurately. When a character conversion cannot take place, or at least not accurately, the module was provide enough information to allow user-facing code to inform the user on how to deal with the situation.
my $is_utf8 = IsStringUTF8ish($str);
Determines if $str
is valid UTF-8. This can mean one of two things:
The function is named IsStringUTF8ish
instead of IsStringUTF8
because in one could be presented with a MARC blob that is not actually in UTF-8 but whose sequence of octets appears to be valid UTF-8. The rest of the MARC character conversion functions will assume that this situation occur does not very often.
my $marc_record = SetUTF8Flag($marc_record, $nfd);
This function sets the PERL UTF8 flag for data. It is required when using new_from_usmarc since MARC::File::USMARC does not handle PERL UTF8 setting. When editing unicode marc records fields and subfields, you would end up in double encoding without using this function.
If $nfd is set, string normalization will use NFD instead of NFC
FIXME In my opinion, this function belongs to MARC::Record and not to this package. But since it handles charset, and MARC::Record, it finds its way in that package
my $normalized_string=NormalizeString($string,$nfd,$transform);
Given a string nfd : If you want to set NFD and not NFC transform : If you expect all the signs to be removed
Sets the PERL UTF8 Flag on your initial data if need be and applies cleaning if required
Returns a utf8 NFC normalized string
Sample code : my $string=NormalizeString ("l'ornithoptère"); #results into ornithoptère in NFC form and sets UTF8 Flag
($marc_record, $converted_from, $errors_arrayref) = MarcToUTF8Record($marc_blob, $marc_flavour, [, $source_encoding]);
Given a MARC blob or a MARC::Record
, the MARC flavour, and an optional source encoding, return a MARC::Record
that is converted to UTF-8.
The returned $marc_record
is guaranteed to be in valid UTF-8, but is not guaranteed to have been converted correctly. Specifically, if $converted_from
is 'failed', the MARC record returned failed character conversion and had each of its non-ASCII octets changed to the Unicode replacement character.
If the source encoding was not specified, this routine will try to guess it; the character encoding used for a successful conversion is returned in $converted_from
.
SetMarcUnicodeFlag($marc_record, $marc_flavour);
Set both the internal MARC::Record encoding flag and the appropriate Leader/09 (MARC21) or 100/26-29 (UNIMARC) to indicate that the record is in UTF-8. Note that this does not do any actual character conversion.
my $new_str = StripNonXmlChars($old_str);
Given a string, return a copy with the characters that are illegal in XML removed.
This function exists to work around a problem that can occur with badly-encoded MARC records. Specifically, if a UTF-8 MARC record also has excape (\x1b) characters, MARC::File::XML will let the escape characters pass through when as_xml() or as_xml_record() is called. The problem is that the escape character is not legal in well-formed XML documents, so when MARC::File::XML attempts to parse such a record, the XML parser will fail.
Stripping such characters will allow a MARC::Record->new_from_xml() to work, at the possible risk of some data loss.
nsb_clean($string);
Removes Non Sorting Block characters
SanitizeRecord($marcrecord);
Sanitize a record This routine is called in the maintenance script misc/maintenance/sanitize_records.pl. It cleans any string with '&...', replacing it by '&'
my ($new_marc_record, $guessed_charset) = _default_marc21_charconv_to_utf8($marc_record);
Converts a MARC::Record
of unknown character set to UTF-8, first by trying a MARC-8 to UTF-8 conversion, then ISO-8859-1 to UTF-8, then a default conversion that replaces each non-ASCII character with the replacement character.
The $guessed_charset
return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed, the value of this is 'failed'.
my ($new_marc_record, $guessed_charset) = _default_unimarc_charconv_to_utf8($marc_record);
Converts a MARC::Record
of unknown character set to UTF-8, first by trying a ISO-5426 to UTF-8 conversion, then ISO-8859-1 to UTF-8, then a default conversion that replaces each non-ASCII character with the replacement character.
The $guessed_charset
return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed, the value of this is 'failed'.
my @errors = _marc_marc8_to_utf8($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record
to UTF-8 in-place from MARC-8. If the conversion fails for some reason, an appropriate messages will be placed in the returned @errors
array.
my @errors = _marc_iso5426_to_utf8($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record
to UTF-8 in-place from ISO-5426. If the conversion fails for some reason, an appropriate messages will be placed in the returned @errors
array.
FIXME - is ISO-5426 equivalent enough to MARC-8 that MARC::Charset
can be used instead?
my @errors = _marc_to_utf8_via_text_iconv($marc_record, $marc_flavour, $source_encoding);
Convert a MARC::Record
to UTF-8 in-place using the Text::Iconv
CPAN module. Any source encoding accepted by the user's iconv installation should work. If the source encoding is not recognized on the user's server or the conversion fails for some reason, appropriate messages will be placed in the returned @errors
array.
_marc_to_utf8_replacement_char($marc_record, $marc_flavour);
Convert a MARC::Record
to UTF-8 in-place, adopting the unsatisfactory method of replacing all non-ASCII (e.g., where the eight bit is set) octet with the Unicode replacement character. This is meant as a last-ditch method, and would be best used as part of a UI that lets a cataloguer pick various character conversions until they find the right one.
my $utf8string = char_decode5426($iso_5426_string);
Converts a string from ISO-5426 to UTF-8.
Koha Development Team <http://koha-community.org/>
Galen Charlton <galen.charlton@liblime.com>