AZURE DATA FACTORY INTERVIEW QUESTIONS: FILES FORMATS
21.Diffrence types of file formats?
1.Delimited text files
CSV Files and TSV Files
CSV Files:-
*Comma Separated Values
* In which comma ,pipe etc. are used to separate the fields and values in CSV files.
TSV Files:-
* Tab separated values.
*In which tab or space are used to separate the fields and values.
*Fixed width data in which each field is allocated a fixed number of characters.
2.JSON
*Java Script Object Notation (JSON).
*JSON format file is a standard data interchange format.
*JSON files are text-based, human-readable, can be edited easily.
example:-
{
"metadata" :{
"Origin" : "Application",
"Alt" : {
"CorrelationId" : "1234"
},
"EventTimeStamp" :"67325282482"
"Event Name" : "Update"
},
{ "PayLoad" : {
"Application" : {
"Application Id" : "6274764",
"customer Id" : "65124851",
" Product " : "Finance ",
"Previous State " : "Submitted" }
}
}
3.XML :-
*Extensible Markup Language.
*It is human readable data format that was popular in the 1990's and 2000's.
*It is same as JSON format but not exactly.
*XML uses tags enclosed in the angle brackets ( <....... >) to define the elements and attributes (Columns)
Example:-
<customers>
<customer name = "Joe " last name="Yash">
<customer details>
<customer type ="Home" number="784651">
<customer type ="Email" number="926312">
< / customer details >
< / customers >
4.BLOB :-
*Binary Large Object (BLOB)
*Blobs are typically Images ,audio or other multimedia objects stored as a blob in the binary form or unstructured data.
5.AVRO :-
*It is a Row-based format.
*Each record contains a header that describes the structure of the data in the record .This record is stored in JSON and data is stored as binary information.
*Easy to compress Data ,Minimizing storage .
Example:
{
"type": "record", "name": "thecodebuzz_schema", "namespace": "thecodebuzz.avro", "fields": [ { "name": "username", "type": "string", "doc": "Name of the user account on Thecodebuzz.com" }, { "name": "email", "type": "string", "doc": "The email of the user logging message on the blog" }, { "name": "timestamp", "type": "long", "doc": "time in seconds" } ], "doc:": "A basic schema for storing the code buzz blogs messages"}6.ORC:-
*Optimized Row Columnar (ORC) format.
*It organizes data into Columns rather than Rows.
*An ORC file contains Stripes of data.
*Each Stripe contains or holds the data for a column or set of columns.
*The Data should be stored in a columnar manner. Each column is stored separately,
enabling efficient compression and selective reading of data.
7.PARQUET:-
*Columnar Data Format.
*A Parquet file contains row groups.
*Data in each column is stored together in the same row group.
*Each row group contains one or more chunks of data.
*A Parquet file includes metadata that describes the set of rows found in each chunk.
*It is not in human readable format.
*Mostly used file format because of more security.
Comments
Post a Comment