feature: add selection, filtering, primitive comparison, formatting and limited casting#257
feature: add selection, filtering, primitive comparison, formatting and limited casting#257mobiusklein wants to merge 2 commits intoapache:mainfrom
Conversation
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Thanks so much for this work! I've left initial comments on the first two files and will try to return to it within the next day.
| </PropertyGroup> | ||
|
|
||
| <PropertyGroup> | ||
| <TargetFrameworks>net8.0</TargetFrameworks> |
There was a problem hiding this comment.
Can we target netstandard2.0 also or do you expect it to be hard to support?
There was a problem hiding this comment.
Okay, we obviously wouldn't be able to use the INumber-based methods and would have to replace them with individual implementations by type. That could perhaps be done another time.
| <PropertyGroup> | ||
| <TargetFrameworks>net8.0</TargetFrameworks> | ||
| <ImplicitUsings>enable</ImplicitUsings> | ||
| <Nullable>enable</Nullable> |
There was a problem hiding this comment.
The other projects don't enable Nullable yet but there's no good reason not to enable it here. They also don't enable ImplicitUsings but I don't have the experience to know why we would want or not want it.
| { | ||
| var builder = new BooleanArray.Builder(); | ||
| builder.Reserve(mask.Length); | ||
| foreach (var val in mask) |
There was a problem hiding this comment.
This can be done much more efficiently by getting the underlying ArrowBuffers from the source array and operating on them directly. If the validity buffer is set, it can be copied as-is (we don't have a way to share buffers, unfortunately). For the value buffer, we could allocate it with ArrowBuffer.Builder and then get its Span property. For both the ReadOnlySpan on the source buffer and the Span on the target buffer, we could then cast them to (ReadOnly)Span<ulong> and invert 64 bits at a time.
To some degree this depends on how interested you are in going down the optimization rabbit hole. You could also use Benchmark.NET and try it both ways to see what kind of difference it makes.
The extreme version of this kind of thing involves learning about some of the SIMD intrinsics that were added to .NET, like Avx2. I started down this path in 2023 but never ended up having the time for it ... .
There was a problem hiding this comment.
The same approach could also be used for the binary operations, except that the validity buffers would need to be ANDed.
In general, I think it's best to implement these primitive operations in terms of ArrowBuffer and then have the Array-based methods call into those.
There was a problem hiding this comment.
I think I probably mean Vector256 and not Avx2, at least initially.
| /// <exception cref="InvalidOperationException"></exception> | ||
| public static BooleanArray And(BooleanArray lhs, BooleanArray rhs, MemoryAllocator? allocator = null) | ||
| { | ||
| if (lhs.Length != rhs.Length) throw new InvalidOperationException("Arrays must have the same length"); |
There was a problem hiding this comment.
Maybe ArgumentException is more appropriate for this kind of error? (I'm not sure.)
| public static BooleanArray Invert(BooleanArray mask, MemoryAllocator? allocator = null) | ||
| { | ||
| var builder = new BooleanArray.Builder(); | ||
| builder.Reserve(mask.Length); |
There was a problem hiding this comment.
Note to myself that I really need to add the initial size to the builder ctor. Or you could do it ;).
| /// <param name="rhs"></param> | ||
| /// <param name="allocator"></param> | ||
| /// <returns></returns> | ||
| public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, T rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T> |
There was a problem hiding this comment.
This doesn't seem to take null values into account.
| /// <exception cref="InvalidOperationException"></exception> | ||
| public static BooleanArray Equal<T>(PrimitiveArray<T> lhs, PrimitiveArray<T> rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T> | ||
| { | ||
| var cmp = new BooleanArray.Builder(); |
There was a problem hiding this comment.
Nit: move the allocation until after the length check. Also applies to a few other methods.
| /// <param name="rhs"></param> | ||
| /// <param name="allocator"></param> | ||
| /// <returns></returns> | ||
| public static BooleanArray Equal(StringArray lhs, string rhs, MemoryAllocator? allocator = null) |
There was a problem hiding this comment.
Should rhs be nullable to allow a comparison to null?
It's worth calling out in the documentation that this implements C# semantics, so null == null. These are different from SQL semantics where the value of null == X is null. I think that's the right thing to do, but it might be surprising to someone expecting more SQL-like behavior.
| } | ||
| } | ||
|
|
||
| public static BooleanArray GreaterThan<T>(PrimitiveArray<T> lhs, T rhs, MemoryAllocator? allocator = null) where T : struct, INumber<T> |
There was a problem hiding this comment.
The INumber-based methods don't seem to take null values into account either.
|
Thank you for the initial feedback. I'll try to get back to this by next weekend. I'll see if I can unwrap the bitwise operations you mentioned, but I'm inclined to maintain the AVX/SIMD stuff as a strictly later problem. |
What's Changed
I've added an
Apache.Arrow.Operationsmodule as recommended in #254 to hold some utilities for generically working with primitive arrays.Included:
Closes #254
I would like some early feedback to see if I am going in the right direction. As .NET isn't a language I work in routinely, I am likely missing common names or idioms for these operations. I also do not know the degree to which the JIT is doing loop unrolling to make these homogenous operations more performant, though if it needs to be hand rolled that is a strictly later problem.