I have a list which contains only 4 numbers. I want to do cartesian multiplication, but in my case because there is no difference between (x,y) and (y,x), I want to delete one of them to avoid redundant calculations later. I removed the diagonal items using the filter operations, but I couldn't remove one of the symmetric items.

data = [1,2,3,4]

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

dataRDD = sc.parallelize(data)
newRDD = dataRDD.cartesian(dataRDD)
newRDD = newRDD.filter(lambda x : x[0]!=x[1]) # removing diagonal items

#expected output:

I had the idea that maybe converting the items to a set in python and doing distinct operation would solve the problem. And I got error because sets are unhashable ... so I came with this idea that maybe I can convert them to set and then I can convert it to str which is okay with distinct operation in pyspark, but I didn't get the correct results :/

In the above example maybe the idea I mentioned would work, but in my own data it works sometimes and sometimes doesn't.

Thanks in advance!


Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Browse other questions tagged or ask your own question.