This work presents and studies the efficiency problem of mapping GPU threads
onto simplex domains. A non-linear map lambda(w) is formulated based on a
block-space enumeration principle that reduces the number of thread-blocks by a
factor of approximately 2x and 6x for 2-simplex and 3-simplex domains,
respectively, when compared to the standard approach. Performance results show
that lambda(w) is competitive and even the fastest map when ran in recent GPU
architectures such as the Tesla V100, where it reaches up to 1.5x of speedup in
2-simplex tests. In 3-simplex tests, it reaches up to 2.3x of speedup for small
workloads and up to 1.25x for larger ones. The results obtained make lambda(w)
a useful GPU optimization technique with applications on parallel problems that
define all-pairs, all-triplets or nearest neighbors interactions in a 2-simplex
or 3-simplex domain.