<feed xmlns='http://www.w3.org/2005/Atom'>
<title>grovel/src/ctype, branch main</title>
<subtitle>Unnamed repository; edit this file 'description' to name the repository.
</subtitle>
<id>http://euandre.org/git/grovel/atom?h=main</id>
<link rel='self' href='http://euandre.org/git/grovel/atom?h=main'/>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/'/>
<updated>2024-01-05T08:47:09Z</updated>
<entry>
<title>Setup stub unit test infrastructure</title>
<updated>2024-01-05T08:47:09Z</updated>
<author>
<name>EuAndreh</name>
<email>eu@euandre.org</email>
</author>
<published>2024-01-04T23:36:02Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=8492f115890d56c98c1da24b9fdf26bb1b714c05'/>
<id>urn:sha1:8492f115890d56c98c1da24b9fdf26bb1b714c05</id>
<content type='text'>
</content>
</entry>
<entry>
<title>fix wcwidth of hangul combining (vowel/final) letters</title>
<updated>2021-12-28T01:08:31Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2021-12-28T01:08:31Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=775bde6b5c04ecc689ecbb4a25ceaf2ed6ab60c8'/>
<id>urn:sha1:775bde6b5c04ecc689ecbb4a25ceaf2ed6ab60c8</id>
<content type='text'>
these characters combine onto a base character (initial) and therefore
need to have width 0. the original binary-search implementation of
wcwidth handled them correctly, but a regression was introduced in
commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd by generating the new
tables from unicode without noticing that the classification logic in
use (unicode character category Mn/Me/Cf) was insufficient to catch
these characters.
</content>
</entry>
<entry>
<title>fix wcwidth wrongly returning 0 for most of planes 4 and up</title>
<updated>2020-01-02T01:02:51Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2020-01-02T01:02:51Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=70d80609558153a996833392999c69cdb74e1119'/>
<id>urn:sha1:70d80609558153a996833392999c69cdb74e1119</id>
<content type='text'>
commit 1b0ce9af6d2aa7b92edaf3e9c631cb635bae22bd introduced this bug
back in 2012 and it was never noticed, presumably since the affected
planes are essentially unused in Unicode.
</content>
</entry>
<entry>
<title>update case mappings to unicode 12.1.0</title>
<updated>2019-10-25T17:44:08Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2019-10-25T17:44:08Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=06d4075a50b84d4b80e09eb5662fc1153bd559f7'/>
<id>urn:sha1:06d4075a50b84d4b80e09eb5662fc1153bd559f7</id>
<content type='text'>
</content>
</entry>
<entry>
<title>update ctype data to unicode 12.1.0</title>
<updated>2019-10-25T17:41:17Z</updated>
<author>
<name>u_quark</name>
<email>el01049@gmail.com</email>
</author>
<published>2019-10-12T21:27:42Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=e95538fa07d2b460b25ee6c2fef05f820888776d'/>
<id>urn:sha1:e95538fa07d2b460b25ee6c2fef05f820888776d</id>
<content type='text'>
</content>
</entry>
<entry>
<title>overhaul wide character case mapping implementation</title>
<updated>2019-10-25T16:39:40Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2019-10-25T16:33:17Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=a11a6246c64b186c5547b9527a768a56f3d6b281'/>
<id>urn:sha1:a11a6246c64b186c5547b9527a768a56f3d6b281</id>
<content type='text'>
the existing implementation of case mappings was very small (typically
around 1.5k), but unmaintainable, requiring manual addition of new
case mappings with each new edition of Unicode. often, it turned out
that newly-added case mappings were not easily representable in the
existing tightly-constrained table structures, requiring new hacks to
be invented and delaying support for new characters.

the new implementation added here follows the pattern used for
character class membership, with a two-level table allowing Unicode
blocks for which no data is needed to be elided. however, rather than
single-bit data, each character maps to a one of up to 6 case-mapping
rules available to its block, where 6 is floor(cbrt(256)) and allow 3
characters to be represented per byte (vs 8 with bit tables). blocks
that would need more than 6 rules designate one as an exception and
let lookup pass into a binary search of exceptional cases for the
block.

the number 6 was chosen empirically; many blocks would be ok with 4
rules (uncased, lower, upper, possible exceptions), some even just
with 2, but the latter are rare and fitting 4 characters per byte
rather than 3 does not save significant space. moreover, somewhat
surprisingly, there are sufficiently many blocks where even 4 rules
don't suffice without a lot of exceptions (blocks where some case
pairs are laced, others offset) that originally I was looking at
supporting variable-width tables, with 1-, 2-, or 3-bit entries,
thereby allowing blocks with 8 rules. as implemented in my
experiments, that version was significantly larger and involved more
memory accesses/cache lines.

improvements in size at the expense of some performance might be
possible by utilizing iswalpha data or merging the table of case
mapping identity with alphabetic identity. these were explored
somewhat when the code was first written, and might be worth
revisiting in the future.
</content>
</entry>
<entry>
<title>add missing case mapping between U+03F3 and U+037F</title>
<updated>2019-10-25T16:23:05Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2019-10-25T16:20:22Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=e8aba58ab19a18f83d7f78e80d5e4f51e7e4e8a9'/>
<id>urn:sha1:e8aba58ab19a18f83d7f78e80d5e4f51e7e4e8a9</id>
<content type='text'>
somehow this seems to have been overlooked. add it now so that
subsequent overhaul of case mapping implementation will not introduce
a functional change at the same time.
</content>
</entry>
<entry>
<title>reduce spurious inclusion of libc.h</title>
<updated>2018-09-12T18:34:37Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2018-09-12T04:08:09Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=5ce3737931bb411a8d167356d4d0287b53b0cbdc'/>
<id>urn:sha1:5ce3737931bb411a8d167356d4d0287b53b0cbdc</id>
<content type='text'>
libc.h was intended to be a header for access to global libc state and
related interfaces, but ended up included all over the place because
it was the way to get the weak_alias macro. most of the inclusions
removed here are places where weak_alias was needed. a few were
recently introduced for hidden. some go all the way back to when
libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented)
cancellation points had to include it.

remaining spurious users are mostly callers of the LOCK/UNLOCK macros
and files that use the LFS64 macro to define the awful *64 aliases.

in a few places, new inclusion of libc.h is added because several
internal headers no longer implicitly include libc.h.

declarations for __lockfile and __unlockfile are moved from libc.h to
stdio_impl.h so that the latter does not need libc.h. putting them in
libc.h made no sense at all, since the macros in stdio_impl.h are
needed to use them correctly anyway.
</content>
</entry>
<entry>
<title>update case mappings to unicode 10.0</title>
<updated>2017-12-19T00:34:21Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2017-12-19T00:33:56Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=54941eddfd9cf2b40e489258e2fbf4bd1c90311e'/>
<id>urn:sha1:54941eddfd9cf2b40e489258e2fbf4bd1c90311e</id>
<content type='text'>
the mapping tables and code are not automatically generated; they were
produced by comparing the output of towupper/towlower against the
mappings in the UCD, ignoring characters that were previously excluded
from case mappings or from alphabetic status (micro sign and circled
letters), and adding table entries or code for everything else
missing.

based very loosely on a patch by Reini Urban.
</content>
</entry>
<entry>
<title>update ctype tables to unicode 10.0</title>
<updated>2017-12-18T23:05:23Z</updated>
<author>
<name>Rich Felker</name>
<email>dalias@aerifal.cx</email>
</author>
<published>2017-12-18T23:05:23Z</published>
<link rel='alternate' type='text/html' href='http://euandre.org/git/grovel/commit/?id=c72c1c52bc08aa0c41654bd0a38f6c951634e088'/>
<id>urn:sha1:c72c1c52bc08aa0c41654bd0a38f6c951634e088</id>
<content type='text'>
</content>
</entry>
</feed>
