1 Star 0 Fork 0

zcy543814 / utf_ranges

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README
BSL-1.0

utf_ranges

A collection of Unicode utilities for C++ using Range-V3

This header-only library contains facilities for transforming between UTF-8, UTF-16 and UTF-32 encoded strings (eagerly and lazily), as well as dealing with byte-order marks and transforming line endings.

Example

A quick overview is best supplied by an example. The following reads a UTF-8 encoded input stream and outputs a UTF-16BE byte stream with byte-order mark:

    namespace rng = ::ranges::v3;
    namespace utf = ::tcb::utf_ranges;

    std::ifstream in_file{"input_file.utf8.txt", std::ios::binary};
    std::ofstream out_file{"output_file.utf16be.txt", std::ios::binary};

    auto view = utf::istreambuf(in_file) // Read range from input stream
            | utf::view::consume_bom     // Remove UTF-8 "BOM" if present
            | utf::view::utf16           // Convert to UTF-16
            | utf::view::add_bom         // Prepend UTF-16 BOM to start of range
            | utf::view::endian_convert<boost::endian::order::big> // Convert to big-endian
            | utf::view::bytes;          // Write to disk as bytes

    rng::copy(view, utf::ostreambuf_iterator<char>{out_file}); // Do the copy

(see example/utf8_to_utf16be.cpp for the full code).

Conversions

For "eager" encoding conversions, the library broadly follows the API specified in Beman Dawes' proposed Unicode conversion library, albeit (currently) with simplified error handling (invalid Unicode characters are simply replaced by the Unicode replacement character U+FFFD). The actual conversion uses code taken from Boost.Locale.

To convert a range of characters between UTF-8, UTF-16 or UTF-32, use the tcb::utf_ranges::utf_convert() function. This takes an InputRange with a value type that is an arithmetic type of size 1, 2 or 4 bytes (for UTF-8, UTF-16 and UTF-32 respectively), and an OutputIterator with a value type similarly defined. For example:

std::string in = u8"Hello world";
std::u16string out;
// Note that the output type cannot be determined automatically, so must be specified
tcb::utf_ranges::utf_convert<char16_t>(in, std::back_inserter(out));

To tranform directly to a new string, the to_utf_string() function is supplied:

std::u16string in = u"Hello world";
std::string out = tcb::utf_ranges::to_utf_string<char>(in);

Convenience functions to_u8string(), to_u16string(), to_u32string() and to_wstring() are also provided (but please don't use the last one):

std::u32string in = U"Hello world";
std::u16string out = tcb::utf_ranges::to_u16string(in);

Views

If you're familiar with Range-V3, you'll know that views perform lazy transformations on a given range -- that is, conversion is done one element at a time when the view is iterated over.

Encoding conversions

This library provides views which lazily perform the same transformations as above. For consistency with Range-V3, these are in the view:: sub-namespace.

std::u16string in = u"Hello world";

auto view = tcb::utf_ranges::view::utf8(in);

ranges::v3::copy(view, std::ostream_iterator<char>(std::cout));

There are similar utf16 and utf32 views

Endian transformations

For UTF-16 and UTF-32, the library provides views which perform byte-swapping between native-, big- and little-ending representations, using code from Boost. The output endianness is specifed by a template parameter, and the input endianness is passed as an argument to the constructor. Both default to boost::endian::native. For example:

std::u16string in = u"Hello world"; // native endian
auto view = tcb::utf_ranges::view::endian_convert<boost::endian::order::big>(in);
std::vector<std::int16_t> out = view; // Copy byte-swapped values to vector

Byte order mark handling

The library provides two views for dealing with "byte order marks", that is, the Unicode non-breaking space character U+FEFF which is often placed at the start of files to allow the endianness to be detected.

To detect a byte-order mark, using the consume_bom view:

std::u16string in = u"\uFEFFHello world"; // native-endian UTF-16 with BOM
auto view = tcb::utf_ranges::view::consume_bom(in);
std::u16string out = view; // copy to new string with BOM removed

As suggested by the name, the byte order mark is removed if present. If a BOM is found an has non-native endianness, endian conversion is automatically performed -- that is, the output of the view will always be native-endian. For UTF-8, if a BOM is detected it is simply removed. If no BOM is present, the string is assumed to be native-endian (for UTF-16 and -32), and is passed through unchanged.

To place a byte-order mark at the start of a string, use the add_bom view:

std::u16string in = u"Hello world";
auto view = tcb::utf_ranges::view::add_bom(in);
std::u16string out = view; // copy to new string, with BOM prepended

Line ending transformation

Unicode specifies eight possible line endings, and recommends that these are converted to the machine native line ending representation on input. In C++, the native representation is "\n". The line_end_transform view performs such a conversion. For example:

std::string in = u8"Hello world\r\n"; // Windows-style
std::string out = tcb::utf_ranges::view::line_end_transform(in);
assert(out == "Hello world\n");

Chaining views

As with Range-V3, operator| is overloaded for views, allowing them to be easily concatenated together, as in the example above.

Licence

This library is provided under the Boost licence. See LICENCE_1_0.txt for details.

Boost Software License - Version 1.0 - August 17th, 2003 Permission is hereby granted, free of charge, to any person or organization obtaining a copy of the software and accompanying documentation covered by this license (the "Software") to use, reproduce, display, distribute, execute, and transmit the Software, and to prepare derivative works of the Software, and to permit third-parties to whom the Software is furnished to do so, all subject to the following: The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

简介

暂无描述 展开 收起
C++ 等 2 种语言
BSL-1.0
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
C++
1
https://gitee.com/zcy543814/utf_ranges.git
git@gitee.com:zcy543814/utf_ranges.git
zcy543814
utf_ranges
utf_ranges
master

搜索帮助