Skip to content

GH-46600: [C++][CI] Add job with ARROW_LARGE_MEMORY_TESTS enabled#49490

Draft
raulcd wants to merge 10 commits intoapache:mainfrom
raulcd:GH-46600
Draft

GH-46600: [C++][CI] Add job with ARROW_LARGE_MEMORY_TESTS enabled#49490
raulcd wants to merge 10 commits intoapache:mainfrom
raulcd:GH-46600

Conversation

@raulcd
Copy link
Member

@raulcd raulcd commented Mar 10, 2026

TBD

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

@github-actions
Copy link

⚠️ GitHub issue #46600 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added CI: Extra: C++ Run extra C++ CI awaiting committer review Awaiting committer review labels Mar 10, 2026
@raulcd
Copy link
Member Author

raulcd commented Mar 10, 2026

@rok I tried with /spot=capacity-optimized, without anything and also forcing /spot=false all seem to fail due to quota:

Error: Failed to launch runner: failed to create fleet: [{"ErrorCode":"VcpuLimitExceeded","ErrorMessage":"You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit."

Should we request an increase of quota for x8i.xlarge?

@rok
Copy link
Member

rok commented Mar 10, 2026

Let me take a look.

@rok
Copy link
Member

rok commented Mar 10, 2026

Requested 8 vcpus for spot and 8 for on-demand.

@raulcd
Copy link
Member Author

raulcd commented Mar 10, 2026

Requested 8 vcpus for spot and 8 for on-demand.

nice! Thanks @rok

@rok
Copy link
Member

rok commented Mar 10, 2026

Requests were approved. Change usually needs some 10 minutes to propagate.

@raulcd
Copy link
Member Author

raulcd commented Mar 11, 2026

I tried with both 64GB and 128GB machines to validate that it wasn't a RAM issue. There are a couple of test failures due to timeout:

	 96 - parquet-arrow-reader-writer-test (Timeout)
	106 - gandiva-projector-test (Timeout)

And one due to what seems like a bug on Parquet (WriteLargeDictEncodedPage):

 [ RUN      ] TestColumnWriter.WriteLargeDictEncodedPage
/arrow/cpp/src/parquet/column_writer_test.cc:1100: Failure
Expected equality of these values:
  page_count
    Which is: 7501
  2
[  FAILED  ] TestColumnWriter.WriteLargeDictEncodedPage (19975 ms)

I'll see if I can reproduce the timeouts locally.

@raulcd
Copy link
Member Author

raulcd commented Mar 11, 2026

I have 64GB of RAM locally.
The TestHugeProjector.SimpleTestSumHuge from gandiva-projector-test takes more than 15 minutes locally for me when running with a Debug build.
When running on release build it takes ~3 minutes but is failing, see:

[----------] 1 test from TestHugeFilter
[ RUN      ] TestHugeFilter.TestSimpleHugeFilter
/home/raulcd/code/arrow/cpp/src/gandiva/tests/huge_table_test.cc:157: Failure
Value of: (exp)->Equals(selection_vector->ToArray(), arrow::EqualOptions().nans_equal(true))
  Actual: false
Expected: true
expected array: [
  4,
  5,
  9,
  11,
  12,
  13,
  19,
  21,
  25,
  26,
  ...
  2147483625,
  2147483627,
  2147483629,
  2147483630,
  2147483636,
  2147483637,
  2147483641,
  2147483643,
  2147483644,
  2147483645
] actual array: [
  0,
  1,
  2,
  3,
  6,
  7,
  8,
  10,
  14,
  15,
  ...
  2147483634,
  2147483635,
  2147483638,
  2147483639,
  2147483640,
  2147483642,
  2147483646,
  2147483647,
  2147483648,
  2147483649
]

[  FAILED  ] TestHugeFilter.TestSimpleHugeFilter (153849 ms)
[----------] 1 test from TestHugeFilter (153850 ms total)

For the parquet-arrow-reader-writer-test the problem is with TestArrowReaderAdHoc.LargeStringColumn, when running locally it takes on a release build ~10 minutes (haven't tested on debug but I am not sure I want to :P)

[ RUN      ] TestArrowReaderAdHoc.LargeStringColumn
[       OK ] TestArrowReaderAdHoc.LargeStringColumn (602823 ms)

The parquet-writer-test TestColumnWriter.WriteLargeDictEncodedPage and TestColumnWriter.ThrowsOnDictIndicesTooLarge also fail locally for me:

[ RUN      ] TestColumnWriter.WriteLargeDictEncodedPage
/home/raulcd/code/arrow/cpp/src/parquet/column_writer_test.cc:1100: Failure
Expected equality of these values:
  page_count
    Which is: 7501
  2

[  FAILED  ] TestColumnWriter.WriteLargeDictEncodedPage (2190 ms)
[ RUN      ] TestColumnWriter.ThrowsOnDictIndicesTooLarge
/home/raulcd/code/arrow/cpp/src/parquet/column_writer_test.cc:1147: Failure
Expected: try { ([&]() { file_writer->Close(); })(); } catch (const ParquetException& err) { switch (0) case 0: default: if (const ::testing::AssertionResult gtest_ar = (::testing::internal::MakePredicateFormatterFromMatcher((::testing::Property(&ParquetException::what, ::testing::HasSubstr("exceeds maximum int value"))))("err", err))) ; else ::testing::internal::AssertHelper(::testing::TestPartResult::kNonFatalFailure, "/home/raulcd/code/arrow/cpp/src/parquet/column_writer_test.cc", 1147, gtest_ar.failure_message()) = ::testing::Message(); throw; } throws an exception of type ParquetException.
  Actual: it throws nothing.

[  FAILED  ] TestColumnWriter.ThrowsOnDictIndicesTooLarge (23736 ms)


My takes from this. We can enable a job that test the memory large tests, currently there seems to be some bugs on them, both for Gandiva and Parquet. We probably want to run on CI with a release build, in order to shorten execution time but even with that we will require like a 15 minutes timeout on individual tests.
@pitrou what are your thoughts?

Should I open individual issues for those tests?

@rok
Copy link
Member

rok commented Mar 11, 2026

+1 to increasing timeouts and including them either into extras and/or release.

@raulcd
Copy link
Member Author

raulcd commented Mar 11, 2026

It seems to require a really long timeout:

 [  FAILED  ] 1 test, listed below:
[  FAILED  ] TestHugeFilter.TestSimpleHugeFilter
 1 FAILED TEST
/build/cpp/src/gandiva/tests
107/107 Test  #96: parquet-arrow-reader-writer-test .............***Timeout 1200.10 sec
Running parquet-arrow-reader-writer-test, redirecting output into /build/cpp/build/test-logs/parquet-arrow-reader-writer-test.txt (attempt 1/1)
        Start  96: parquet-arrow-reader-writer-test

@pitrou
Copy link
Member

pitrou commented Mar 11, 2026

  1. We want to test in debug mode to keep all runtime checks, assertions etc., activated, but we can enable some optimizations, see [C++][CI] Have a job with ARROW_LARGE_MEMORY_TESTS enabled #46600 (comment)
  2. We can just disable Gandiva, it's not really maintained anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review CI: Extra: C++ Run extra C++ CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants