aboutsummaryrefslogtreecommitdiff
path: root/src/multibyte (follow)
Commit message (Expand)AuthorAgeFilesLines
* Setup stub unit test infrastructureEuAndreh2024-01-0520-0/+160
* mbrtowc: Fix wrong return value when n > UINT_MAX•••mbrtowc truncates n to unsigned int when storing its copy. If n > UINT_MAX and the locale is not POSIX, the function will return a wrong value greater than UINT_MAX on the success path. Alexey Izbyshev2023-05-261-1/+1
* rewrite wcsnrtombs to fix buffer overflow and other bugs•••the original wcsnrtombs implementation, which has been largely untouched since 0.5.0, attempted to build input-length-limiting conversion on top of wcsrtombs, which only limits output length. as best I recall, this choice was made out of a mix of disdain over having yet another variant function to implement (added in POSIX 2008; not standard C) and preference not to switch things around and implement the wcsrtombs in terms of the more general new function, probably over namespace issues. the strategy employed was to impose output limits that would ensure the input limit wasn't exceeded, then finish up the tail character-at-a-time. unfortunately, none of that worked correctly. first, the logic in the wcsrtombs loop was wrong in that it could easily get stuck making no forward progress, by imposing an output limit too small to convert even one character. the character-at-a-time loop that followed was even worse. it made no effort to ensure that the converted multibyte character would fit in the remaining output space, only that there was a nonzero amount of output space remaining. it also employed an incorrect interpretation of wcrtomb's interface contract for converting the null character, thereby failing to act on end of input, and remaining space accounting was subject to unsigned wrap-around. together these errors allow unbounded overflow of the destination buffer, controlled by input length limit and input wchar_t string contents. given the extent to which this function was broken, it's plausible that most applications that would have been rendered exploitable were sufficiently broken not to be usable in the first place. however, it's also plausible that common (especially ASCII-only) inputs succeeded in the wcsrtombs loop, which mostly worked, while leaving the wildly erroneous code in the second loop exposed to particular non-ASCII inputs. CVE-2020-28928 has been assigned for this issue. Rich Felker2020-11-191-27/+19
* fix aliasing-based undefined behavior in mbsrtowcs•••mbsrtowcs contains "vectorized" loops to quickly step over bytes without the high bit set; these have undefined behavior by virtue of aliasing uint32_t over top of char data for the accesses. commit 4d0a82170a25464c39522d7190b9fe302045ddb2 fixed the corresponding usage in string functions by using the may_alias attribute conditional on __GNUC__ and disabled the vectorized code in its absence. do the same for mbsrtowcs. Rich Felker2019-10-131-2/+8
* reduce spurious inclusion of libc.h•••libc.h was intended to be a header for access to global libc state and related interfaces, but ended up included all over the place because it was the way to get the weak_alias macro. most of the inclusions removed here are places where weak_alias was needed. a few were recently introduced for hidden. some go all the way back to when libc.h defined CANCELPT_BEGIN and _END, and all (wrongly implemented) cancellation points had to include it. remaining spurious users are mostly callers of the LOCK/UNLOCK macros and files that use the LFS64 macro to define the awful *64 aliases. in a few places, new inclusion of libc.h is added because several internal headers no longer implicitly include libc.h. declarations for __lockfile and __unlockfile are moved from libc.h to stdio_impl.h so that the latter does not need libc.h. putting them in libc.h made no sense at all, since the macros in stdio_impl.h are needed to use them correctly anyway. Rich Felker2018-09-121-1/+1
* define and use internal macros for hidden visibility, weak refs•••this cleans up what had become widespread direct inline use of "GNU C" style attributes directly in the source, and lowers the barrier to increased use of hidden visibility, which will be useful to recovering some of the efficiency lost when the protected visibility hack was dropped in commit dc2f368e565c37728b0d620380b849c3a1ddd78f, especially on archs where the PLT ABI is costly. Rich Felker2018-09-051-4/+2
* fix erroneous acceptance of f4 9x xx xx code sequences by utf-8 decoder•••the DFA table controlling accepted ranges for the f4 prefix used an incorrect upper bound of 0xa0 where it should have been 0x90, allowing such sequences to be accepted and decoded as non-Unicode-scalar values 0x110000 through 0x11ffff. Rich Felker2017-09-011-1/+1
* fix erroneous stop before input limit in mbsnrtowcs and wcsnrtombs•••the value computed as an output limit that bounds the amount of input consumed below the input limit was incorrectly being used as the actual amount of input consumed. instead, compute the actual amount of input consumed as a difference of pointers before and after the conversion. patch by Mikhail Kremnyov. Rich Felker2017-08-312-2/+6
* remove comments on copyright status from UTF-8 implementation files•••despite clarifications made to the COPYRIGHT file in commit f0a61399330bae42beeb27d6ecd05570b3382a60, there continues to be confusion about whether the permissions granted actually apply to all files. I am the sole author of these files and clearly intend, and have always intended, for the grant of permission to apply to them. Rich Felker2016-06-2113-78/+0
* explicitly include stdio.h to get EOF definition needed by wctobMichael Meeuwisse2016-03-021-0/+1
* fix undefined left-shift of negative values in utf-8 state tableRich Felker2015-07-251-1/+1
* byte-based C locale, phase 1: multibyte character handling functions•••this patch makes the functions which work directly on multibyte characters treat the high bytes as individual abstract code units rather than as multibyte sequences when MB_CUR_MAX is 1. since MB_CUR_MAX is presently defined as a constant 4, all of the new code added is dead code, and optimizing compilers' code generation should not be affected at all. a future commit will activate the new code. as abstract code units, bytes 0x80 to 0xff are represented by wchar_t values 0xdf80 to 0xdfff, at the end of the surrogates range. this ensures that they will never be misinterpreted as Unicode characters, and that all wctype functions return false for these "characters" without needing locale-specific logic. a high range outside of Unicode such as 0x7fffff80 to 0x7fffffff was also considered, but since C11's char16_t also needs to be able to represent conversions of these bytes, the surrogate range was the natural choice. Rich Felker2015-06-167-3/+46
* fix btowc corner case•••btowc is required to interpret its argument by conversion to unsigned char, unless the argument is equal to EOF. since the conversion to produces a non-character value anyway, we can just unconditionally convert, for now. Rich Felker2015-06-161-0/+1
* remove libc.h dependency from otherwise-independent multibyte codeRich Felker2015-04-221-2/+4
* remove cruft for libc struct accessor function and broken visibility•••these were hacks to work around toolchains that could not properly optimize PIC accesses based on visibility and would generate GOT lookups even for hidden data, which broke the old dynamic linker. since commit f3ddd173806fd5c60b3f034528ca24542aecc5b9 it no longer matters; the dynamic linker does not assume accessibility of this data until stage 3. Rich Felker2015-04-221-4/+0
* fix return value computation in one code path of wcsnrtombs•••the affected code was wrongly counting characters instead of bytes. Rich Felker2014-12-181-1/+1
* implement a private state for the uchar.h functions•••The C standard is imperative on that: 7.28.1 ... If ps is a null pointer, each function uses its own internal mbstate_t object instead, which is initialized at program startup to the initial conversion state; and these functions are also not supposed to implicitly use the state of the wchar.h functions: 7.29.6.3 ... The implementation behaves as if no library function calls these functions with a null pointer for ps. Previously this resulted in two bugs. - The functions c16rtomb and mbrtoc16 would crash when called with ps set to null. - The function mbrtoc32 used the private state of mbrtowc, which it is not allowed to do. Jens Gustedt2014-11-153-0/+6
* implement uchar.h (C11 UTF-16/32 conversion) interfacesRich Felker2014-10-134-0/+79
* fix aliasing violations in mbtowc and mbrtowc•••these functions were setting wc to point to wchar_t aliasing itself as a "cheap" way to support null wc arguments. doing so was anything but cheap, since even without the aliasing violation, it would limit the compiler's ability to optimize. making wc point to a dummy object is equally easy and does not suffer from the above problems. Rich Felker2014-07-012-2/+4
* fix incorrect end pointer in some cases when wcsrtombs stops early•••when wcsrtombs stopped due to hitting zero remaining space in the output buffer, it was wrongly clearing the position pointer as if it had completed the conversion successfully. this commit rearranges the code somewhat to make a clear separation between the cases of ending due to running out of output buffer space, and ending due to reaching the end of input or an illegal sequence in the input. the new branches have been arranged with the hope of optimizing more common cases, too. Rich Felker2014-06-021-7/+15
* include cleanups: remove unused headers and add feature test macrosSzabolcs Nagy2013-12-1213-51/+3
* fix buffer overflow in mbsrtowcs•••issue reported by Michael Forney: "If wn becomes 0 after processing a chunk of 4, mbsrtowcs currently continues on, wrapping wn around to -1, causing the rest of the string to be processed. This resulted in buffer overruns if there was only space in ws for wn wide characters." the original patch submitted added an additional check for !wn after the loop; to avoid extra branching, I instead just changed the wn>=4 check to wn>=5 to ensure that at least one slot remains after the word-at-a-time loop runs. this should not slow down the tail processing on real-world usage, since an extra slot that can't be processed in the word-at-a-time loop is needed for the null termination anyway. Rich Felker2013-09-271-1/+1
* fix failure of mbsrtowcs to record stop position when dest is fullRich Felker2013-06-291-1/+4
* mbrtowc: do not leave mbstate_t in permanent-fail state after EILSEQ•••the standard is clear that the old behavior is conforming: "In this case, [EILSEQ] shall be stored in errno and the conversion state is undefined." however, the specification of mbrtowc has one peculiarity when the source argument is a null pointer: in this case, it's required to behave as mbrtowc(NULL, "", 1, ps). no motivation is provided for this requirement, but the natural one that comes to mind is that the intent is to reset the mbstate_t object. for stateful encodings, such behavior is actually specified: "If the corresponding wide character is the null wide character, the resulting state described shall be the initial conversion state." but in the case of UTF-8 where the mbstate_t object contains a partially-decoded character rather than a shift state, a subsequent '\0' byte indicates that the previous partial character is incomplete and thus an illegal sequence. naturally, applications using their own mbstate_t object should clear it themselves after an error, but the standard presently provides no way to clear the builtin mbstate_t object used when the ps argument is a null pointer. I suspect this issue may be addressed in the future by specifying that a null source argument resets the state, as this seems to have been the intent all along. for what it's worth, this change also slightly reduces code size. Rich Felker2013-04-081-1/+1
* implement mbtowc directly, not as a wrapper for mbrtowc•••the interface contract for mbtowc admits a much faster implementation than mbrtowc can achieve; wrapping mbrtowc with an extra call frame only made the situation worse. since the regex implementation uses mbtowc already, this change should improve regex performance too. it may be possible to improve performance in other places internally by switching from mbrtowc to mbtowc. Rich Felker2013-04-081-5/+39
* optimize mbrtowc•••this simple change, in my measurements, makes about a 7% performance improvement. at first glance this change would seem like a compiler-specific hack, since the modified code is not even used. however, I suspect the reason is that I'm eliminating a second path into the main body of the code, allowing the compiler more flexibility to optimize the normal (hot) path into the main body. so even if it weren't for the measurable (and quite notable) difference in performance, I think the change makes sense. Rich Felker2013-04-081-3/+2
* fix out-of-bounds access in UTF-8 decoding•••SA and SB are used as the lowest and highest valid starter bytes, but the value of SB was one-past the last valid starter. this caused access past the end of the state table when the illegal byte '\xf5' was encountered in a starter position. the error did not show up in full-character decoding tests, since the bogus state read from just past the table was unlikely to admit any continuation bytes as valid, but would have shown up had we tested feeding '\xf5' to the byte-at-a-time decoding in mbrtowc: it would cause the funtion to wrongly return -2 rather than -1. I may eventually go back and remove all references to SA and SB, replacing them with the values; this would make the code more transparent, I think. the original motivation for using macros was to allow misguided users of the code to redefine them for the purpose of enlarging the set of accepted sequences past the end of Unicode... Rich Felker2013-04-081-1/+1
* cleanup wcstombs•••remove redundant headers and comments; this file is completely trivial now. also, avoid temp var. Rich Felker2013-04-041-12/+1
* cleanup mbstowcs wrapper•••remove unneeded headers. this file is utterly trivial now and there's no sense in having a comment to state that it's in the public domain. Rich Felker2013-04-041-10/+0
* minor optimization to mbstowcs•••there is no need to zero-fill an mbstate_t object in the caller; mbsrtowcs will automatically treat a null pointer as the initial state. Rich Felker2013-04-041-2/+1
* fix incorrect range checks in wcsrtombs•••negative values of wchar_t need to be treated in the non-ASCII case so that they can properly generate EILSEQ rather than getting truncated to 8bit values and stored in the output. Rich Felker2013-04-041-3/+3
* overhaul mbsrtowcs•••these changes fix at least two bugs: - misaligned access to the input as uint32_t for vectorized ASCII test - incorrect src pointer after stopping on EILSEQ in addition, the text of the standard makes it unclear whether the mbstate_t object is to be modified when the destination pointer is null; previously it was cleared either way; now, it's only cleared when the destination is non-null. this change may need revisiting, but it should not affect most applications, since calling mbsrtowcs with non-zero state can only happen when the head of the string was already processed with mbrtowc. finally, these changes shave about 20% size off the function and seem to improve performance by 1-5%. Rich Felker2013-04-041-69/+64
* use restrict everywhere it's required by c99 and/or posix 2008•••to deal with the fact that the public headers may be used with pre-c99 compilers, __restrict is used in place of restrict, and defined appropriately for any supported compiler. we also avoid the form [restrict] since older versions of gcc rejected it due to a bug in the original c99 standard, and instead use the form *restrict. Rich Felker2012-09-0610-11/+12
* fix failure of mbsinit(0) (not UB; required to return nonzero)•••issue reported by Richard Pennington; slightly simpler fix applied Rich Felker2012-05-261-1/+1
* fix longstanding exit logic bugs in mbsnrtowcs and wcsnrtombs•••these are POSIX 2008 (previously GNU extension) functions that are rarely used. apparently they had never been tested before, since the end-of-string logic was completely missing. mbsnrtowcs is used by modern versions of bash for its glob implementation, and and this bug was causing tab completion to hang in an infinite loop. Rich Felker2012-05-022-4/+9
* new attempt at working around the gcc 3 visibility bug•••since gcc is failing to generate the necessary ".hidden" directive in the output asm, generate it explicitly with an __asm__ statement... Rich Felker2012-02-241-0/+4
* remove useless attribute visibility from definitions•••this was a failed attempt at working around the gcc 3 visibility bug affecting x86_64. subsequent patch will address it with an ugly but working hack. Rich Felker2012-02-241-1/+1
* cleanup and work around visibility bug in gcc 3 that affects x86_64•••in gcc 3, the visibility attribute must be placed on both the declaration and on the definition. if it's omitted from the definition, the compiler fails to emit the ".hidden" directive in the assembly, and the linker will either generate textrels (if supported, such as on i386) or refuse to link (on targets where certain types of textrels are forbidden or impossible without further assumptions about memory layout, such as on x86_64). this patch also unifies the decision about when to use visibility into libc.h and makes the visibility in the utf-8 state machine tables based on libc.h rather than a duplicate test. Rich Felker2012-02-232-6/+4
* fix all implicit conversion between signed/unsigned pointers•••sadly the C language does not specify any such implicit conversion, so this is not a matter of just fixing warnings (as gcc treats it) but actual errors. i would like to revisit a number of these changes and possibly revise the types used to reduce the number of casts required. Rich Felker2011-03-251-1/+1
* cleanup utf-8 multibyte code, use visibility if possible•••this code was written independently of musl, with support for a the backwards, nonstandard "31-bit unicode" some libraries/apps might want. unfortunately the extra code (inside #ifdef) makes the source harder to read and makes code that should be simple look complex, so i'm removing it. anyone who wants to use the old code can find it in the history or from elsewhere. also, change the visibility of the __fsmu8 state machine table to hidden, if supported. this should improve performance slightly in shared-library builds. Rich Felker2011-02-273-84/+5
* remove sample utf-8 code that's not part of the standard libraryRich Felker2011-02-211-47/+0
* cleanup multibyte stuff to remove ugly casts, sanitize the ptr align castsRich Felker2011-02-133-27/+27
* initial check-in, version 0.5.0Rich Felker2011-02-1218-0/+694