grovel/src/ctype, branch main

grovel/src/ctype, branch main Unnamed repository; edit this file 'description' to name the repository. http://euandre.org/git/grovel/atom?h=main 2024-01-05T08:47:09Z Setup stub unit test infrastructure 2024-01-05T08:47:09Z EuAndreh eu@euandre.org 2024-01-04T23:36:02Z urn:sha1:8492f115890d56c98c1da24b9fdf26bb1b714c05 fix wcwidth of hangul combining (vowel/final) letters 2021-12-28T01:08:31Z Rich Felker dalias@aerifal.cx 2021-12-28T01:08:31Z urn:sha1:775bde6b5c04ecc689ecbb4a25ceaf2ed6ab60c8 these characters combine onto a base character (initial) and therefore need to have width 0. the original binary-search implementation of wcwidth handled them correctly, but a regression was introduced in commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd by generating the new tables from unicode without noticing that the classification logic in use (unicode character category Mn/Me/Cf) was insufficient to catch these characters. fix wcwidth wrongly returning 0 for most of planes 4 and up 2020-01-02T01:02:51Z Rich Felker dalias@aerifal.cx 2020-01-02T01:02:51Z urn:sha1:70d80609558153a996833392999c69cdb74e1119 commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd introduced this bug back in 2012 and it was never noticed, presumably since the affected planes are essentially unused in Unicode. update case mappings to unicode 12.1.0 2019-10-25T17:44:08Z Rich Felker dalias@aerifal.cx 2019-10-25T17:44:08Z urn:sha1:06d4075a50b84d4b80e09eb5662fc1153bd559f7 update ctype data to unicode 12.1.0 2019-10-25T17:41:17Z u_quark el01049@gmail.com 2019-10-12T21:27:42Z urn:sha1:e95538fa07d2b460b25ee6c2fef05f820888776d overhaul wide character case mapping implementation 2019-10-25T16:39:40Z Rich Felker dalias@aerifal.cx 2019-10-25T16:33:17Z urn:sha1:a11a6246c64b186c5547b9527a768a56f3d6b281 the existing implementation of case mappings was very small (typically around 1.5k), but unmaintainable, requiring manual addition of new case mappings with each new edition of Unicode. often, it turned out that newly-added case mappings were not easily representable in the existing tightly-constrained table structures, requiring new hacks to be invented and delaying support for new characters. the new implementation added here follows the pattern used for character class membership, with a two-level table allowing Unicode blocks for which no data is needed to be elided. however, rather than single-bit data, each character maps to a one of up to 6 case-mapping rules available to its block, where 6 is floor(cbrt(256)) and allow 3 characters to be represented per byte (vs 8 with bit tables). blocks that would need more than 6 rules designate one as an exception and let lookup pass into a binary search of exceptional cases for the block. the number 6 was chosen empirically; many blocks would be ok with 4 rules (uncased, lower, upper, possible exceptions), some even just with 2, but the latter are rare and fitting 4 characters per byte rather than 3 does not save significant space. moreover, somewhat surprisingly, there are sufficiently many blocks where even 4 rules don't suffice without a lot of exceptions (blocks where some case pairs are laced, others offset) that originally I was looking at supporting variable-width tables, with 1-, 2-, or 3-bit entries, thereby allowing blocks with 8 rules. as implemented in my experiments, that version was significantly larger and involved more memory accesses/cache lines. improvements in size at the expense of some performance might be possible by utilizing iswalpha data or merging the table of case mapping identity with alphabetic identity. these were explored somewhat when the code was first written, and might be worth revisiting in the future. add missing case mapping between U+03F3 and U+037F 2019-10-25T16:23:05Z Rich Felker dalias@aerifal.cx 2019-10-25T16:20:22Z urn:sha1:e8aba58ab19a18f83d7f78e80d5e4f51e7e4e8a9 somehow this seems to have been overlooked. add it now so that subsequent overhaul of case mapping implementation will not introduce a functional change at the same time. reduce spurious inclusion of libc.h 2018-09-12T18:34:37Z Rich Felker dalias@aerifal.cx 2018-09-12T04:08:09Z urn:sha1:5ce3737931bb411a8d167356d4d0287b53b0cbdc libc.h was intended to be a header for access to global libc state and related interfaces, but ended up included all over the place because it was the way to get the weak_alias macro. most of the inclusions removed here are places where weak_alias was needed. a few were recently introduced for hidden. some go all the way back to when libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented) cancellation points had to include it. remaining spurious users are mostly callers of the LOCK/UNLOCK macros and files that use the LFS64 macro to define the awful *64 aliases. in a few places, new inclusion of libc.h is added because several internal headers no longer implicitly include libc.h. declarations for __lockfile and __unlockfile are moved from libc.h to stdio_impl.h so that the latter does not need libc.h. putting them in libc.h made no sense at all, since the macros in stdio_impl.h are needed to use them correctly anyway. update case mappings to unicode 10.0 2017-12-19T00:34:21Z Rich Felker dalias@aerifal.cx 2017-12-19T00:33:56Z urn:sha1:54941eddfd9cf2b40e489258e2fbf4bd1c90311e the mapping tables and code are not automatically generated; they were produced by comparing the output of towupper/towlower against the mappings in the UCD, ignoring characters that were previously excluded from case mappings or from alphabetic status (micro sign and circled letters), and adding table entries or code for everything else missing. based very loosely on a patch by Reini Urban. update ctype tables to unicode 10.0 2017-12-18T23:05:23Z Rich Felker dalias@aerifal.cx 2017-12-18T23:05:23Z urn:sha1:c72c1c52bc08aa0c41654bd0a38f6c951634e088