Passing UTF-8 encoded data to JSON parser

jo.moeller · June 23, 2023, 11:59am

Hello,

I am trying to build a release build of Spidermonkey with SanitizerCoverage’s callbacks enabled to access the JSON parser. Currently, I use the following code to access the JSON parser:

static JSClassOps global_ops = {
    nullptr, nullptr, nullptr, nullptr, nullptr,
    nullptr, nullptr, nullptr, nullptr, JS_GlobalObjectTraceHook};

/* The class of the global object. */
static JSClass global_class = {"global", JSCLASS_GLOBAL_FLAGS, &global_ops};

static JSContext* cx = nullptr;
static JSObject* global = nullptr;

extern "C" int foo(const char* buf, size_t size,)
{
    if (!JS_Init()) {
        return 1;
    }

    cx = JS_NewContext(JS::DefaultHeapMaxBytes);

    if (!cx) {
        assert(false);
        return 1;
    }

    if (!JS::InitSelfHostedCode(cx)) {
        assert(false);
        return 1;
    }

    JS::RealmOptions options;
    global = JS_NewGlobalObject(cx, &global_class, nullptr,
                                JS::FireOnNewGlobalHook, options);

    assert(cx != nullptr && global != nullptr);

    {
        JSAutoRealm realm(cx, global);

        std::vector<char> buf_null_terminated(buf, buf+size);
        buf_null_terminated.push_back('\0');
        JS::ConstUTF8CharsZ utf8chars(&buf_null_terminated[0], buf_null_terminated.size() - 1);

        JS::RootedString str(cx, JS_NewStringCopyUTF8Z(cx, utf8chars));
        JS::RootedValue parsed(cx);
        if (!JS_ParseJSON(cx, str, &parsed)) {
            return 0;
        }
  /* ... */
}

I want the JSON parser to treat the incoming char array as UTF-8 encoded data. I have tried several variations using JS::ConstUTF8CharsZ, UTF8CharsZ and UTF8Chars. ConstUTF8Z seems to be the correct version for my use case as I want to keep ownership of the char array, cf. “This class does not manage ownership of the data” (a ConstUTF8CharsN version would probably be even better as my buffer is not null terminated). However I encountered a problem with this setup: in the constructor of ConstUTF8Z the validate() call is executed which crashes the program, if the provided data is not UTF-8.

My first question is: Is there a different way to provide UTF-8 encoded data to the JSON parser.

If not: I tried compiling a release version to get rid of the validate() call. From several websites (1, 2, 3) I have created the following build steps:

$ git clone git clone https://github.com/mozilla/gecko-dev
$ mkdir build
$ cd build
$ SANFLAGS="-fsanitize=address -fsanitize-coverage=inline-8bit-counters,pc-table -fno-omit-frame-pointer" CFLAGS="$SANFLAGS" CXXFLAGS="$SANFLAGS" LDFLAGS="$SANFLAGS" MOZ_DEBUG_SYMBOLS=1 MOZ_LLVM_HACKS=1 ../gecko-dev/js/src/configure --disable-jemalloc --with-system-zlib --with-intl-api --enable-optimize="-O2 -gline-tables-only" --enable-release --disable-debug --enable-address-sanitizer --disable-install-strip
$ make

I load the shared object libmozjs-114a1.so, but upon calling JS_Init I get an error from js/src/vm/Initialization.cpp:InitWithFailureDiagnostic because the release version was compiled with DEBUG=1. I already added --enable-release and --disable-debug. So my second question is: How do I make DEBUG undefined when compiling spidermonkey?

jo.moeller · June 26, 2023, 12:18pm

I think I found the correct way to decode bytes using UTF-8:

JSString* useUFT8Chars(JSContext* cx, const char* buf, size_t size)
{
    std::string str(buf, buf+size);
    JS::UTF8Chars utf8chars(str.c_str(), str.size());
    return JS_NewStringCopyUTF8N(cx, utf8chars);
}

JSString* s = useUFT8Chars(cx, buf, size);
if (s == nullptr) {
   exit(1);
}

JS::RootedString str(cx, s);

I hope this is correct as the documentation of ConstUTF8CharsZ seems to suggest that UTF8Chars gains ownership of the passed buffer (which is hopefully not the case here).