| Commit message (Expand) | Author | Age | Files | Lines |
| * | fix wcwidth wrongly returning 0 for most of planes 4 and up•••commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd introduced this bug
back in 2012 and it was never noticed, presumably since the affected
planes are essentially unused in Unicode.
| Rich Felker | 2020-01-01 | 1 | -1/+1 |
| * | update case mappings to unicode 12.1.0 | Rich Felker | 2019-10-25 | 1 | -85/+92 |
| * | update ctype data to unicode 12.1.0 | u_quark | 2019-10-25 | 4 | -201/+232 |
| * | overhaul wide character case mapping implementation•••the existing implementation of case mappings was very small (typically
around 1.5k), but unmaintainable, requiring manual addition of new
case mappings with each new edition of Unicode. often, it turned out
that newly-added case mappings were not easily representable in the
existing tightly-constrained table structures, requiring new hacks to
be invented and delaying support for new characters.
the new implementation added here follows the pattern used for
character class membership, with a two-level table allowing Unicode
blocks for which no data is needed to be elided. however, rather than
single-bit data, each character maps to a one of up to 6 case-mapping
rules available to its block, where 6 is floor(cbrt(256)) and allow 3
characters to be represented per byte (vs 8 with bit tables). blocks
that would need more than 6 rules designate one as an exception and
let lookup pass into a binary search of exceptional cases for the
block.
the number 6 was chosen empirically; many blocks would be ok with 4
rules (uncased, lower, upper, possible exceptions), some even just
with 2, but the latter are rare and fitting 4 characters per byte
rather than 3 does not save significant space. moreover, somewhat
surprisingly, there are sufficiently many blocks where even 4 rules
don't suffice without a lot of exceptions (blocks where some case
pairs are laced, others offset) that originally I was looking at
supporting variable-width tables, with 1-, 2-, or 3-bit entries,
thereby allowing blocks with 8 rules. as implemented in my
experiments, that version was significantly larger and involved more
memory accesses/cache lines.
improvements in size at the expense of some performance might be
possible by utilizing iswalpha data or merging the table of case
mapping identity with alphabetic identity. these were explored
somewhat when the code was first written, and might be worth
revisiting in the future.
| Rich Felker | 2019-10-25 | 2 | -285/+340 |
| * | add missing case mapping between U+03F3 and U+037F•••somehow this seems to have been overlooked. add it now so that
subsequent overhaul of case mapping implementation will not introduce
a functional change at the same time.
| Rich Felker | 2019-10-25 | 1 | -0/+1 |
| * | reduce spurious inclusion of libc.h•••libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.
remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.
in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.
declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
| Rich Felker | 2018-09-12 | 29 | -29/+1 |
| * | update case mappings to unicode 10.0•••the mapping tables and code are not automatically generated; they were
produced by comparing the output of towupper/towlower against the
mappings in the UCD, ignoring characters that were previously excluded
from case mappings or from alphabetic status (micro sign and circled
letters), and adding table entries or code for everything else
missing.
based very loosely on a patch by Reini Urban.
| Rich Felker | 2017-12-18 | 1 | -2/+41 |
| * | update ctype tables to unicode 10.0 | Rich Felker | 2017-12-18 | 4 | -220/+305 |
| * | reformat ctype tables to be diff-friendly, match tool output•••the new version of the code used to generate these tables forces a
newline every 256 entries, whereas at the time these files were
originally generated and committed, it only wrapped them at 80
columns. the new behavior ensures that localized changes to the
tables, if they are ever needed, will produce localized diffs.
commit d060edf6c569ba9df4b52d6bcd93edde812869c9 made the corresponding
changes to the iconv tables.
| Rich Felker | 2017-12-18 | 4 | -263/+276 |
| * | towupper/towlower: fast path for ascii chars•••Make a fast path for ascii chars which is assumed to be the most common
case. This has significant performance benefit on xml json and similar
| Natanael Copa | 2017-05-31 | 1 | -3/+3 |
| * | byte-based C locale, phase 1: multibyte character handling functions•••this patch makes the functions which work directly on multibyte
characters treat the high bytes as individual abstract code units
rather than as multibyte sequences when MB_CUR_MAX is 1. since
MB_CUR_MAX is presently defined as a constant 4, all of the new code
added is dead code, and optimizing compilers' code generation should
not be affected at all. a future commit will activate the new code.
as abstract code units, bytes 0x80 to 0xff are represented by wchar_t
values 0xdf80 to 0xdfff, at the end of the surrogates range. this
ensures that they will never be misinterpreted as Unicode characters,
and that all wctype functions return false for these "characters"
without needing locale-specific logic. a high range outside of Unicode
such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's
char16_t also needs to be able to represent conversions of these
bytes, the surrogate range was the natural choice.
| Rich Felker | 2015-06-16 | 1 | -2/+3 |
| * | add macro version of ctype.h isascii function•••presumably internal code (ungetwc and fputwc) was written assuming a
macro implementation existed; otherwise use of isascii is just a
pessimization.
| Rich Felker | 2015-06-06 | 1 | -0/+1 |
| * | fix case mapping for U+00DF (ß)•••U+00DF ('ß') has had an uppercase form (U+1E9E) available since
Unicode 5.1, but Unicode lacks the case mappings for it due to
stability policy. when I added support for the new character in commit
1a63a9fc30e7a1f1239e3cedcb5041e5ec1c5351, I omitted the mapping in the
lowercase-to-uppercase direction. this choice was not based on any
actual information, only assumptions.
this commit adds bidirectional case mappings between U+00DF and
U+1E9E, and removes the special-case hack that allowed U+00DF to be
identified as lowecase despite lacking a mapping. aside from strong
evidence that this is the "right" behavior for real-world usage of
these characters, several factors informed this decision:
- the other "potentially correct" mapping, to "SS", is not
representable in the C case-mapping system anyway.
- leaving one letter in lowercase form when transforming a string to
uppercase is obviously wrong.
- having a character which is nominally lowercase but which is fixed
under case mapping violates reasonable invariants.
| Rich Felker | 2014-09-05 | 2 | -2/+1 |
| * | add inline isspace in ctype.h as an optimization•••isspace can be a bottleneck in a simple parser, inlining it
gives slightly smaller and faster code
src/locale/pleval.o already had this optimization, the size
change for other libc functions for i386 is
src/internal/intscan.o 2134 2118 -16
src/locale/dcngettext.o 1562 1552 -10
src/network/res_msend.o 1961 1940 -21
src/network/lookup_name.o 2627 2608 -19
src/network/getnameinfo.o 1814 1811 -3
src/network/lookup_serv.o 643 624 -19
src/stdio/vfscanf.o 2675 2663 -12
src/stdlib/atoll.o 117 107 -10
src/stdlib/atoi.o 95 91 -4
src/stdlib/atol.o 95 91 -4
src/time/strptime.o 1515 1503 -12
(TOTALS) 432451 432321 -130
| Szabolcs Nagy | 2014-08-13 | 1 | -0/+1 |
| * | consolidate *_l ctype/wctype functions into their non-_l source files•••the main practical purposes of this commit are to remove a huge amount
of clutter from the src/locale directory, to cut down on the length of
the $(AR) and $(LD) command lines, and to reduce the amount of space
wasted by object file headers in the static libc.a. build time may
also be reduced, though this has not been measured.
as an additional justification, if there ever were a need for the
behavior of these functions to vary by locale, it would be necessary
for the non-_l versions to call the _l versions, so that linking the
former without the latter would not be possible anyway.
| Rich Felker | 2014-07-02 | 29 | -9/+256 |
| * | include cleanups: remove unused headers and add feature test macros | Szabolcs Nagy | 2013-12-12 | 4 | -6/+3 |
| * | iswspace: fix handling of 0 | rofl0r | 2013-11-11 | 1 | -2/+1 |
| * | fix types for wctype_t and wctrans_t•••wctype_t was incorrectly "int" rather than "long" on x86_64. not only
is this an ABI incompatibility; it's also a major design flaw if we
ever wanted wctype_t to be implemented as a pointer, which would be
necessary if locales support custom character classes, since int is
too small to store a converted pointer. this commit fixes wctype_t to
be unsigned long on all archs, matching the LSB ABI; this change does
not matter for C code, but for C++ it affects mangling.
the same issue applied to wctrans_t. glibc/LSB defines this type as
const __int32_t *, but since no such definition is visible, I've just
expanded the definition, int, everywhere.
it would be nice if these types (which don't vary by arch) could be in
wctype.h, but the OB XSI requirement in POSIX that wchar.h expose some
types and functions from wctype.h precludes doing so. glibc works
around this with some hideous hacks, but trying to duplicate that
would go against the intent of musl's headers.
| Rich Felker | 2013-03-04 | 1 | -4/+4 |
| * | make some arrays const•••this way they'll go into .rodata, decreasing memory pressure.
| rofl0r | 2013-02-02 | 3 | -4/+4 |
| * | fix argument type error on wcwidth function•••since the correct declaration was not visible, and since the
representation of the types wchar_t and wint_t always match, a
compiler would have to go out of its way to make this bug manifest,
but better to fix it anyway.
| Rich Felker | 2012-08-02 | 1 | -2/+2 |
| * | fix broken wcwidth tables•••unicode char data has both "W" and "F" wide types and the old table
only included the "W" ones. this omitted U+3000 (ideographic space)
and all the wide-ascii, etc.
| Rich Felker | 2012-06-20 | 1 | -7/+8 |
| * | fix ctype abi junk (pointer should point to 0 slot, not -128 slot) | Rich Felker | 2012-06-05 | 3 | -3/+3 |
| * | add LSB abi junk for ctype functions•••this should be the last major fix needed to support running
glibc-linked conforming POSIX programs with musl in place of glibc, as
long as musl provides the features they need and they don't use
pthread cancellation (which is implemented as c++ exceptions in glibc,
and fundamentally incompatible with musl).
| Rich Felker | 2012-06-02 | 3 | -0/+104 |
| * | new wcwidth implementation (fast table-based)•••i tried to go with improving the old binary-search-based algorithm,
but between growth in the number of ranges, bad performance, and lack
of confidence in the binary search code's stability under changes in
the table, i decided it was worth the extra 1.8k to have something
clean and maintainable.
also note that, like the alpha and punct tables, there's definitely
room to optimize the nonspacing/wide tables by overlapping subtables.
this is not a high priority, but i've begun looking into how to do it,
and i suspect the table sizes can be roughly halved. if that turns out
to be true, the new, fast, table-based implementation will be roughly
the same size as if i had just extended the old binary search one.
| Rich Felker | 2012-04-24 | 3 | -179/+125 |
| * | sync case mappings with unicode 6.1•••also special-case ß (U+00DF) as lowercase even though it does not have
a mapping to uppercase. unicode added an uppercase version of this
character but does not map it, presumably because the uppercase
version is not actually used except for some obscure purpose...
| Rich Felker | 2012-04-23 | 2 | -8/+30 |
| * | optimize iswprint | Rich Felker | 2012-04-23 | 1 | -3/+12 |
| * | fix spurious punct class for some surrogate codepoints (invalid)•••this happened due to their entries in UnicodeData.txt
| Rich Felker | 2012-04-23 | 1 | -59/+56 |
| * | destubify iswalpha and update iswpunct to unicode 6.1•••alpha is defined as unicode property "Alphabetic" plus category Nd
minus ASCII digits minus 2 special-cased Thai punctuation marks
supposedly misclassified by Unicode as letters.
punct is defined as all of unicode except control, alphanumeric, and
space characters.
the tables were generated by a simple tool based on the code posted
previously to the mailing list. in the future, this and other code
used for maintaining locale/iconv/i18n data will be published either
in the main source repository or in a separate locale data generation
repository.
| Rich Felker | 2012-04-23 | 5 | -135/+252 |
| * | document iswspace and remove wrongly-included zwsp character | Rich Felker | 2012-02-09 | 1 | -1/+5 |
| * | fix typo in iswspace space list table | Rich Felker | 2012-02-09 | 1 | -1/+1 |
| * | more header fixes, minor warning fix | Rich Felker | 2011-02-14 | 1 | -0/+1 |
| * | initial check-in, version 0.5.0 | Rich Felker | 2011-02-12 | 34 | -0/+854 |