0

I have a list which contains only 4 numbers. I want to do cartesian multiplication, but in my case because there is no difference between (x,y) and (y,x), I want to delete one of them to avoid redundant calculations later. I removed the diagonal items using the filter operations, but I couldn't remove one of the symmetric items.

data = [1,2,3,4]

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

dataRDD = sc.parallelize(data)
newRDD = dataRDD.cartesian(dataRDD)
newRDD = newRDD.filter(lambda x : x[0]!=x[1]) # removing diagonal items
newRDD.collect()

#expected output:
[(1,2),(1,3),(1,4),(2,3),(2,4),(3,4)]

I had the idea that maybe converting the items to a set in python and doing distinct operation would solve the problem. And I got error because sets are unhashable ... so I came with this idea that maybe I can convert them to set and then I can convert it to str which is okay with distinct operation in pyspark, but I didn't get the correct results :/

In the above example maybe the idea I mentioned would work, but in my own data it works sometimes and sometimes doesn't.

Thanks in advance!

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Browse other questions tagged or ask your own question.