NAME

C4::Charset - utilities for handling character set conversions.

SYNOPSIS

  use C4::Charset;

DESCRIPTION

This module contains routines for dealing with character set conversions, particularly for MARC records.

A variety of character encodings are in use by various MARC standards, and even more character encodings are used by non-standard MARC records. The various MARC formats generally do not do a good job of advertising a given record's character encoding, and even when a record does advertise its encoding, e.g., via the Leader/09, experience has shown that one cannot trust it.

Ultimately, all MARC records are stored in Koha in UTF-8 and must be converted from whatever the source character encoding is. The goal of this module is to ensure that these conversions take place accurately. When a character conversion cannot take place, or at least not accurately, the module was provide enough information to allow user-facing code to inform the user on how to deal with the situation.

FUNCTIONS

IsStringUTF8ish

  my $is_utf8 = IsStringUTF8ish($str);

Determines if $str is valid UTF-8. This can mean one of two things:

The Perl UTF-8 flag is set and the string contains valid UTF-8.
The Perl UTF-8 flag is not set, but the octets contain valid UTF-8.

The function is named IsStringUTF8ish instead of IsStringUTF8 because in one could be presented with a MARC blob that is not actually in UTF-8 but whose sequence of octets appears to be valid UTF-8. The rest of the MARC character conversion functions will assume that this situation occur does not very often.

SetUTF8Flag

  my $marc_record = SetUTF8Flag($marc_record, $nfd);

This function sets the PERL UTF8 flag for data. It is required when using new_from_usmarc since MARC::File::USMARC does not handle PERL UTF8 setting. When editing unicode marc records fields and subfields, you would end up in double encoding without using this function.

If $nfd is set, string normalization will use NFD instead of NFC

FIXME In my opinion, this function belongs to MARC::Record and not to this package. But since it handles charset, and MARC::Record, it finds its way in that package

NormalizeString

    my $normalized_string=NormalizeString($string,$nfd,$transform);

Given a string nfd : If you want to set NFD and not NFC transform : If you expect all the signs to be removed

Sets the PERL UTF8 Flag on your initial data if need be and applies cleaning if required

Returns a utf8 NFC normalized string

Sample code : my $string=NormalizeString ("l'ornithoptère"); #results into ornithoptère in NFC form and sets UTF8 Flag

MarcToUTF8Record

  ($marc_record, $converted_from, $errors_arrayref) = MarcToUTF8Record($marc_blob, 
                                        $marc_flavour, [, $source_encoding]);

Given a MARC blob or a MARC::Record, the MARC flavour, and an optional source encoding, return a MARC::Record that is converted to UTF-8.

The returned $marc_record is guaranteed to be in valid UTF-8, but is not guaranteed to have been converted correctly. Specifically, if $converted_from is 'failed', the MARC record returned failed character conversion and had each of its non-ASCII octets changed to the Unicode replacement character.

If the source encoding was not specified, this routine will try to guess it; the character encoding used for a successful conversion is returned in $converted_from.

SetMarcUnicodeFlag

  SetMarcUnicodeFlag($marc_record, $marc_flavour);

Set both the internal MARC::Record encoding flag and the appropriate Leader/09 (MARC21) or 100/26-29 (UNIMARC) to indicate that the record is in UTF-8. Note that this does not do any actual character conversion.

StripNonXmlChars

  my $new_str = StripNonXmlChars($old_str);

Given a string, return a copy with the characters that are illegal in XML removed.

This function exists to work around a problem that can occur with badly-encoded MARC records. Specifically, if a UTF-8 MARC record also has excape (\x1b) characters, MARC::File::XML will let the escape characters pass through when as_xml() or as_xml_record() is called. The problem is that the escape character is not legal in well-formed XML documents, so when MARC::File::XML attempts to parse such a record, the XML parser will fail.

Stripping such characters will allow a MARC::Record->new_from_xml() to work, at the possible risk of some data loss.

nsb_clean

nsb_clean($string);

Removes Non Sorting Block characters

SanitizeRecord

SanitizeRecord($marcrecord);

Sanitize a record This routine is called in the maintenance script misc/maintenance/sanitize_records.pl. It cleans any string with '&amp;...', replacing it by '&'

INTERNAL FUNCTIONS

_default_marc21_charconv_to_utf8

  my ($new_marc_record, $guessed_charset) = _default_marc21_charconv_to_utf8($marc_record);

Converts a MARC::Record of unknown character set to UTF-8, first by trying a MARC-8 to UTF-8 conversion, then ISO-8859-1 to UTF-8, then a default conversion that replaces each non-ASCII character with the replacement character.

The $guessed_charset return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed, the value of this is 'failed'.

_default_unimarc_charconv_to_utf8

  my ($new_marc_record, $guessed_charset) = _default_unimarc_charconv_to_utf8($marc_record);

Converts a MARC::Record of unknown character set to UTF-8, first by trying a ISO-5426 to UTF-8 conversion, then ISO-8859-1 to UTF-8, then a default conversion that replaces each non-ASCII character with the replacement character.

The $guessed_charset return value contains the character set that resulted in a conversion to valid UTF-8; note that if the MARC-8 and ISO-8859-1 conversions failed, the value of this is 'failed'.

_marc_marc8_to_utf8

  my @errors = _marc_marc8_to_utf8($marc_record, $marc_flavour, $source_encoding);

Convert a MARC::Record to UTF-8 in-place from MARC-8. If the conversion fails for some reason, an appropriate messages will be placed in the returned @errors array.

_marc_iso5426_to_utf8

  my @errors = _marc_iso5426_to_utf8($marc_record, $marc_flavour, $source_encoding);

Convert a MARC::Record to UTF-8 in-place from ISO-5426. If the conversion fails for some reason, an appropriate messages will be placed in the returned @errors array.

FIXME - is ISO-5426 equivalent enough to MARC-8 that MARC::Charset can be used instead?

_marc_to_utf8_via_text_iconv

  my @errors = _marc_to_utf8_via_text_iconv($marc_record, $marc_flavour, $source_encoding);

Convert a MARC::Record to UTF-8 in-place using the Text::Iconv CPAN module. Any source encoding accepted by the user's iconv installation should work. If the source encoding is not recognized on the user's server or the conversion fails for some reason, appropriate messages will be placed in the returned @errors array.

_marc_to_utf8_replacement_char

  _marc_to_utf8_replacement_char($marc_record, $marc_flavour);

Convert a MARC::Record to UTF-8 in-place, adopting the unsatisfactory method of replacing all non-ASCII (e.g., where the eight bit is set) octet with the Unicode replacement character. This is meant as a last-ditch method, and would be best used as part of a UI that lets a cataloguer pick various character conversions until they find the right one.

char_decode5426

  my $utf8string = char_decode5426($iso_5426_string);

Converts a string from ISO-5426 to UTF-8.

AUTHOR

Koha Development Team <http://koha-community.org/>

Galen Charlton <galen.charlton@liblime.com>