How to select a particular row with a condition on pyspark?

Dataframes in Spark is distributed and that means you can’t access the data in a typical procedural way like pandas df.loc[].
We need do an analysis in order to get a specific row(rows). Here is an example:

import pyspark
from pyspark.sql import SQLContext
df = sqlContext.createDataFrame([(“a”,1), (“b”,2), (“c”,3)], [“letter”, “name”])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l,v), i): i == myIndex)
.map(lambda ((l,v), i): (l,v))
.collect())
print(values[0])

 

Leave a comment