Respuesta :
Answer:
see explaination
Explanation:
The data is collected in a comma-separated values (CSV) format that always includes the following fields:
? date: string
? time: string
? client_ip: string
? server_ip: string
? url_stem: string
? url_query: string
? client_bytes: integer
? server_bytes: integer
What should you do?
a. Load the data as lines of text into an RDD, then split the text based on a comma-delimiter and load the RDD into DataFrame.
# import the module csv
import csv
import pandas as pd
# open the csv file
with open(r"C:\Users\uname\Downloads\abc.csv") as csv_file:
# read the csv file
csv_reader = csv.reader(csv_file, delimiter=',')
# now we can use this csv files into the pandas
df = pd.DataFrame([csv_reader], index=None)
df.head()
b. Define a schema for the data, then read the data from the CSV file into a DataFrame using the schema.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
newschema = StructType([
StructField("date", DateType(),true),
StructField("time", DateType(),true),
StructField("client_ip", StringType(),true),
StructField("server_ip", StringType(),true),
StructField("url_stem", StringType(),true),
StructField("url_query", StringType(),true),
StructField("client_bytes", IntegerType(),true),
StructField("server_bytes", IntegerType(),true])
c. Read the data from the CSV file into a DataFrame, infering the schema.
abc_DF = spark.read.load('C:\Users\uname\Downloads\new_abc.csv', format="csv", header="true", sep=' ', schema=newSchema)
d. Convert the data to tab-delimited format, then read the data from the text file into a DataFrame, infering the schema.
Import pandas as pd
Df2 = pd.read_csv(‘new_abc.csv’,delimiter="\t")
print('Contents of Dataframe : ')
print(Df2)