Skip to content

feature: add selection, filtering, primitive comparison, formatting and limited casting#257

Open
mobiusklein wants to merge 2 commits intoapache:mainfrom
mobiusklein:feature/add-operations
Open

feature: add selection, filtering, primitive comparison, formatting and limited casting#257
mobiusklein wants to merge 2 commits intoapache:mainfrom
mobiusklein:feature/add-operations

Conversation

@mobiusklein
Copy link
Contributor

What's Changed

I've added an Apache.Arrow.Operations module as recommended in #254 to hold some utilities for generically working with primitive arrays.

Included:

  • Array subset selection by index or boolean mask
  • Primitive array and string array equality and comparison
  • Limited numerical aggregations
  • A recursive, not quite complete pretty-printing routine I wrote for debugging.
  • A small suite of tests to exercise the fiddly bits of indices

Closes #254

I would like some early feedback to see if I am going in the right direction. As .NET isn't a language I work in routinely, I am likely missing common names or idioms for these operations. I also do not know the degree to which the JIT is doing loop unrolling to make these homogenous operations more performant, though if it needs to be hand rolled that is a strictly later problem.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this work! I've left initial comments on the first two files and will try to return to it within the next day.

</PropertyGroup>

<PropertyGroup>
<TargetFrameworks>net8.0</TargetFrameworks>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we target netstandard2.0 also or do you expect it to be hard to support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we obviously wouldn't be able to use the INumber-based methods and would have to replace them with individual implementations by type. That could perhaps be done another time.

<PropertyGroup>
<TargetFrameworks>net8.0</TargetFrameworks>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other projects don't enable Nullable yet but there's no good reason not to enable it here. They also don't enable ImplicitUsings but I don't have the experience to know why we would want or not want it.

{
var builder = new BooleanArray.Builder();
builder.Reserve(mask.Length);
foreach (var val in mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done much more efficiently by getting the underlying ArrowBuffers from the source array and operating on them directly. If the validity buffer is set, it can be copied as-is (we don't have a way to share buffers, unfortunately). For the value buffer, we could allocate it with ArrowBuffer.Builder and then get its Span property. For both the ReadOnlySpan on the source buffer and the Span on the target buffer, we could then cast them to (ReadOnly)Span<ulong> and invert 64 bits at a time.

To some degree this depends on how interested you are in going down the optimization rabbit hole. You could also use Benchmark.NET and try it both ways to see what kind of difference it makes.

The extreme version of this kind of thing involves learning about some of the SIMD intrinsics that were added to .NET, like Avx2. I started down this path in 2023 but never ended up having the time for it ... .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same approach could also be used for the binary operations, except that the validity buffers would need to be ANDed.

In general, I think it's best to implement these primitive operations in terms of ArrowBuffer and then have the Array-based methods call into those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I probably mean Vector256 and not Avx2, at least initially.

/// <exception cref="InvalidOperationException"></exception>
public static BooleanArray And(BooleanArray lhs, BooleanArray rhs, MemoryAllocator? allocator = null)
{
if (lhs.Length != rhs.Length) throw new InvalidOperationException("Arrays must have the same length");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe ArgumentException is more appropriate for this kind of error? (I'm not sure.)

public static BooleanArray Invert(BooleanArray mask, MemoryAllocator? allocator = null)
{
var builder = new BooleanArray.Builder();
builder.Reserve(mask.Length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself that I really need to add the initial size to the builder ctor. Or you could do it ;).

/// <param name="rhs"></param>
/// <param name="allocator"></param>
/// <returns></returns>
public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, T rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to take null values into account.

/// <exception cref="InvalidOperationException"></exception>
public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, PrimitiveArray<T> rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T>
{
var cmp = new BooleanArray.Builder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move the allocation until after the length check. Also applies to a few other methods.

/// <param name="rhs"></param>
/// <param name="allocator"></param>
/// <returns></returns>
public static BooleanArray Equal(StringArray lhs, string rhs, MemoryAllocator? allocator = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rhs be nullable to allow a comparison to null?

It's worth calling out in the documentation that this implements C# semantics, so null == null. These are different from SQL semantics where the value of null == X is null. I think that's the right thing to do, but it might be surprising to someone expecting more SQL-like behavior.

}
}

public static BooleanArray GreaterThan<T>(PrimitiveArray<T> lhs, T rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INumber-based methods don't seem to take null values into account either.

@mobiusklein
Copy link
Contributor Author

Thank you for the initial feedback. I'll try to get back to this by next weekend. I'll see if I can unwrap the bitwise operations you mentioned, but I'm inclined to maintain the AVX/SIMD stuff as a strictly later problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A few starter generic compute kernels

2 participants