Skip to content

Commit df316c2

Browse files
authored
Merge pull request #4963 from Ckarst/patch-39
Update create-external-file-format-transact-sql.md
2 parents 4d9726e + bdab78b commit df316c2

1 file changed

Lines changed: 53 additions & 36 deletions

File tree

docs/t-sql/statements/create-external-file-format-transact-sql.md

Lines changed: 53 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: "CREATE EXTERNAL FILE FORMAT (Transact-SQL) | Microsoft Docs"
33
ms.custom: ""
4-
ms.date: "12/08/2017"
4+
ms.date: "2/20/2018"
55
ms.prod: "sql-non-specified"
66
ms.prod_service: "sql-data-warehouse, pdw, sql-database"
77
ms.service: ""
@@ -31,19 +31,19 @@ ms.workload: "On Demand"
3131
# CREATE EXTERNAL FILE FORMAT (Transact-SQL)
3232
[!INCLUDE[tsql-appliesto-ss2016-xxxx-asdw-pdw-md](../../includes/tsql-appliesto-ss2016-xxxx-asdw-pdw-md.md)]
3333

34-
Creates a PolyBase external file format definition for external data stored in Hadoop, Azure blob storage, or Azure Data Lake Store. Creating an external file format is a prerequisite for creating a PolyBase external table. By creating an external file format, you specify the actual layout of the data referenced by an external table.
34+
Creates an External File Format object defining external data stored in Hadoop, Azure Blob Storage, or Azure Data Lake Store. Creating an external file format is a prerequisite for creating an External Table. By creating an External File Format, you specify the actual layout of the data referenced by an external table.
3535

3636
PolyBase supports the following file formats:
3737

38-
- Delimited text
38+
- Delimited Text
3939

4040
- Hive RCFile
4141

4242
- Hive ORC
4343

4444
- Parquet
4545

46-
To create an external table, see [CREATE EXTERNAL TABLE (Transact-SQL)](../../t-sql/statements/create-external-table-transact-sql.md).
46+
To create an External Table, see [CREATE EXTERNAL TABLE (Transact-SQL)](../../t-sql/statements/create-external-table-transact-sql.md).
4747

4848
![Topic link icon](../../database-engine/configure-windows/media/topic-link.gif "Topic link icon") [Transact-SQL Syntax Conventions](../../t-sql/language-elements/transact-sql-syntax-conventions-transact-sql.md)
4949

@@ -92,7 +92,8 @@ WITH (
9292
<format_options> ::=
9393
{
9494
FIELD_TERMINATOR = field_terminator
95-
| STRING_DELIMITER = string_delimiter
95+
| STRING_DELIMITER = string_delimiter
96+
| First_Row = integer -- ONLY AVAILABLE SQL DW
9697
| DATE_FORMAT = datetime_format
9798
| USE_TYPE_DEFAULT = { TRUE | FALSE }
9899
| Encoding = {'UTF8' | 'UTF16'}
@@ -107,7 +108,7 @@ WITH (
107108
Specifies the format of the external data.
108109

109110
- PARQUET
110-
Specifies a Parquet format.
111+
Specifies a Parquet format.
111112

112113
- ORC
113114
Specifies an Optimized Row Columnar (ORC) format. This option requires Hive version 0.11 or higher on the external Hadoop cluster. In Hadoop, the ORC file format offers better compression and performance than the RCFILE file format.
@@ -124,8 +125,8 @@ WITH (
124125
- DELIMITEDTEXT
125126
Specifies a text format with column delimiters, also called field terminators.
126127

127-
FIELD_TERMINATOR = *field_terminator*
128-
Applies only to delimited text files. The field terminator specifies one or more characters that mark the end of each field (column) in the text-delimited file. The default is the pipe character ꞌ|ꞌ. For guaranteed support, we recommend using one or more ascii characters.
128+
FIELD_TERMINATOR = *field_terminator*
129+
Applies only to delimited text files. The field terminator specifies one or more characters that mark the end of each field (column) in the text-delimited file. The default is the pipe character ꞌ|ꞌ. For guaranteed support, we recommend using one or more ascii characters.
129130

130131

131132
Examples:
@@ -139,7 +140,7 @@ WITH (
139140
- FIELD_TERMINATOR = '~|~'
140141

141142
STRING_DELIMITER = *string_delimiter*
142-
Specifies the field terminator for data of type string in the text-delimited file. The string delimiter is one or more characters in length and is enclosed with single quotes. The default is the empty string "". For guaranteed support, we recommend using one or more ascii characters.
143+
Specifies the field terminator for data of type string in the text-delimited file. The string delimiter is one or more characters in length and is enclosed with single quotes. The default is the empty string "". For guaranteed support, we recommend using one or more ascii characters.
143144

144145

145146
Examples:
@@ -153,13 +154,16 @@ WITH (
153154
- STRING_DELIMITER = ꞌ,ꞌ
154155

155156
- STRING_DELIMITER = '0x7E0x7E' -- Two tildes (for example, ~~)
156-
157-
DATE\_FORMAT = *datetime_format*
158-
Specifies a custom format for all date and time data that might appear in a delimited text file. If the source file uses default datetime formats, this option is not necessary. Only one custom datetime format is allowed per file. You cannot specify multiple custom datetime formats per file. However, you can use multiple datetime formats, if each one is the default format for its respective data type in the external table definition.
157+
158+
FIRST_ROW = *First_row_int*
159+
Specifies the row number that is read first in all files during a PolyBase load. This parameter can take values 1-15. If the value is set to two, the first row in every file (header row) is skipped when the data is loaded. Rows are skipped based on the existence of row terminators (/r/n, /r, /n). When this option is used for export, rows are added to the data to make sure the file can be read with no data loss. If the value is set to >2, the first row exported is the Column names of the external table.
160+
161+
DATE\_FORMAT = *datetime_format*
162+
Specifies a custom format for all date and time data that might appear in a delimited text file. If the source file uses default datetime formats, this option isn't necessary. Only one custom datetime format is allowed per file. You can't specify more than one custom datetime formats per file. However, you can use more than one datetime formats if each one is the default format for its respective data type in the external table definition.
159163

160-
PolyBase only uses the custom date format for importing the data. It does not use the custom format for writing data to an external file.
164+
PolyBase only uses the custom date format for importing the data. It doesn't use the custom format for writing data to an external file.
161165

162-
When DATE_FORMAT is not specified or is the empty string, PolyBase uses the following default formats:
166+
When DATE_FORMAT isn't specified or is the empty string, PolyBase uses the following default formats:
163167

164168
- DateTime: 'yyyy-MM-dd HH:mm:ss'
165169

@@ -173,15 +177,15 @@ PolyBase only uses the custom date format for importing the data. It does not us
173177

174178
- Time: 'HH:mm:ss'
175179

176-
**Example date formats** are in the following table.
180+
**Example date formats** are in the following table:
177181

178-
Notes about the table:
182+
Notes about the table:
179183

180-
- Year, month, and day can have a variety of formats and orders. The table shows only the **ymd** format. Month can have 1 or 2 digits, or 3 characters. Day can have 1 or 2 digits. Year can have 2 or 4 digits.
184+
- Year, month, and day can have a variety of formats and orders. The table shows only the **ymd** format. Month can have one or two digits, or three characters. Day can have one or two digits. Year can have two or four digits.
181185

182-
- Milliseconds (fffffff) is not required.
186+
- Milliseconds (fffffff) are not required.
183187

184-
- Am, pm (tt) is not required. The default is AM.
188+
- Am, pm (tt) isn't required. The default is AM.
185189

186190
|Date Type|Example|Description|
187191
|---------------|-------------|-----------------|
@@ -216,22 +220,22 @@ PolyBase only uses the custom date format for importing the data. It does not us
216220

217221
Details:
218222

219-
- To separate month, day and year values, you can use '', ' / ', or ' . '. For simplicity, the table uses only the ' – ' separator.
223+
- To separate month, day and year values, you can use '', '/', or '.'. For simplicity, the table uses only the ' – ' separator.
220224

221-
- To specify the month as text, use three or more characters. Months with 1 or 2 characters are interpreted as a number.
225+
- To specify the month as text, use three or more characters. Months with one or two characters are interpreted as a number.
222226

223-
- To separate time values, use the ' : ' symbol.
227+
- To separate time values, use the ':' symbol.
224228

225229
- Letters enclosed in square brackets are optional.
226230

227231
- The letters 'tt' designate [AM|PM|am|pm]. AM is the default. When 'tt' is specified, the hour value (hh) must be in the range of 0 to 12.
228232

229233
- The letters 'zzz' designate the time zone offset for the system's current time zone in the format {+|-}HH:ss].
230234

231-
USE_TYPE_DEFAULT = { TRUE | **FALSE** }
235+
USE_TYPE_DEFAULT = { TRUE | **FALSE** }
232236
Specifies how to handle missing values in delimited text files when PolyBase retrieves data from the text file.
233237

234-
TRUE
238+
TRUE
235239
When retrieving data from the text file, store each missing value by using the default value for the data type of the corresponding column in the external table definition. For example, replace a missing value with:
236240

237241
- 0 if the column is defined as a numeric column.
@@ -240,15 +244,15 @@ PolyBase only uses the custom date format for importing the data. It does not us
240244

241245
- 1900-01-01 if the column is a date column.
242246

243-
FALSE
247+
FALSE
244248
Store all missing values as NULL. Any NULL values that are stored by using the word NULL in the delimited text file are imported as the string 'NULL'.
245249

246-
Encoding = {'UTF8' | 'UTF16'}
247-
In Azure SQL Data Warehouse, PolyBase can read UTF8 and UTF16-LE encoded delimited text files. In SQL Server and PDW, PolyBase does not support reading UTF16 encoded files.
250+
Encoding = {'UTF8' | 'UTF16'}
251+
In Azure SQL Data Warehouse, PolyBase can read UTF8 and UTF16-LE encoded delimited text files. In SQL Server and PDW, PolyBase doesn't support reading UTF16 encoded files.
248252

249-
DATA_COMPRESSION = *data_compression_method*
250-
Specifies the data compression method for the external data. When DATA_COMPRESSION is not specified, the default is uncompressed data.
251-
In order to work properly, Gzip compressed files must have the ".gz" file extension.
253+
DATA_COMPRESSION = *data_compression_method*
254+
Specifies the data compression method for the external data. When DATA_COMPRESSION isn't specified, the default is uncompressed data.
255+
To work properly, Gzip compressed files must have the ".gz" file extension.
252256

253257
The DELIMITEDTEXT format type supports these compression methods:
254258

@@ -266,7 +270,7 @@ PolyBase only uses the custom date format for importing the data. It does not us
266270

267271
- DATA COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
268272

269-
The PARQUET file format type supports the folliwing compression methods:
273+
The PARQUET file format type supports the following compression methods:
270274

271275
- DATA COMPRESSION = 'org.apache.hadoop.io.compress.GzipCodec'
272276

@@ -300,12 +304,12 @@ PolyBase only uses the custom date format for importing the data. It does not us
300304
## Performance
301305
Using compressed files always comes with the tradeoff between transferring less data between the external data source and SQL Server while increasing the CPU usage to compress and decompress the data.
302306

303-
Gzip compressed text files are not splittable. To improve performance for Gzip compressed text files, we recommend generating multiple files that are all stored in the same directory within the external data source. This allows PolyBase to read and decompress the data faster by using multiple reader and decompression processes. The ideal number of compressed files is the maximum number of data reader processes per compute node. In [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] and [!INCLUDE[ssPDW](../../includes/sspdw-md.md)], the maximum number of data reader processes is 8 per node in the current release. In [!INCLUDE[ssSDW](../../includes/sssdw-md.md)], the maximum number of data reader processes per node varies by SLO. See [Azure SQL Data Warehouse loading patterns and strategies](https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/) for details.
307+
Gzip compressed text files are not splittable. To improve performance for Gzip compressed text files, we recommend generating multiple files that are all stored in the same directory within the external data source. This file structure allows PolyBase to read and decompress the data faster by using multiple reader and decompression processes. The ideal number of compressed files is the maximum number of data reader processes per compute node. In [!INCLUDE[ssNoVersion](../../includes/ssnoversion-md.md)] and [!INCLUDE[ssPDW](../../includes/sspdw-md.md)], the maximum number of data reader processes is 8 per node in the current release. In [!INCLUDE[ssSDW](../../includes/sssdw-md.md)], the maximum number of data reader processes per node varies by SLO. See [Azure SQL Data Warehouse loading patterns and strategies](https://blogs.msdn.microsoft.com/sqlcat/2016/02/06/azure-sql-data-warehouse-loading-patterns-and-strategies/) for details.
304308

305309
## Examples
306310

307311
### A. Create a DELIMITEDTEXT external file format
308-
This example creates an external file format named *textdelimited1* for a text-delimited file. The options listed for FORMAT\_OPTIONS specify that the fields in the file should be separated using a pipe character '|'. The text file is also compressed with the Gzip codec. If DATA\_COMPRESSION is not specified, the text file is uncompressed.
312+
This example creates an external file format named *textdelimited1* for a text-delimited file. The options listed for FORMAT\_OPTIONS specify that the fields in the file should be separated using a pipe character '|'. The text file is also compressed with the Gzip codec. If DATA\_COMPRESSION isn't specified, the text file is uncompressed.
309313

310314
For a delimited text file, the data compression method can either be the default Codec, 'org.apache.hadoop.io.compress.DefaultCodec', or the Gzip Codec, 'org.apache.hadoop.io.compress.GzipCodec'.
311315

@@ -321,7 +325,7 @@ WITH (
321325
```
322326

323327
### B. Create an RCFile external file format
324-
This example creates an external file format for a RCFile that uses the serialization/deserialization method org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe. It also specifies to use the Default Codec for the data compression method. If DATA_COMPRESSION is not specified, the default is no compression.
328+
This example creates an external file format for a RCFile that uses the serialization/deserialization method org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe. It also specifies to use the Default Codec for the data compression method. If DATA_COMPRESSION isn't specified, the default is no compression.
325329

326330
```
327331
CREATE EXTERNAL FILE FORMAT rcfile1
@@ -333,7 +337,7 @@ WITH (
333337
```
334338

335339
### C. Create an ORC external file format
336-
This example creates an external file format for an ORC file that compresses the data with the org.apache.io.compress.SnappyCodec data compression method. If DATA_COMPRESSION is not specified, the default is no compression.
340+
This example creates an external file format for an ORC file that compresses the data with the org.apache.io.compress.SnappyCodec data compression method. If DATA_COMPRESSION isn't specified, the default is no compression.
337341

338342
```
339343
CREATE EXTERNAL FILE FORMAT orcfile1
@@ -344,7 +348,7 @@ WITH (
344348
```
345349

346350
### D. Create a PARQUET external file format
347-
This example creates an external file format for a Parquet file that compresses the data with the org.apache.io.compress.SnappyCodec data compression method. If DATA_COMPRESSION is not specified, the default is no compression.
351+
This example creates an external file format for a Parquet file that compresses the data with the org.apache.io.compress.SnappyCodec data compression method. If DATA_COMPRESSION isn't specified, the default is no compression.
348352

349353
```
350354
CREATE EXTERNAL FILE FORMAT parquetfile1
@@ -353,7 +357,20 @@ WITH (
353357
DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
354358
);
355359
```
360+
### E. Create a Delimited Text File Skipping Header Row (Azure SQL DW Only)
361+
This example creates an external file format for CSV file with a single header row.
356362

363+
```
364+
CREATE EXTERNAL FILE FORMAT skipHeader_CSV
365+
WITH (FORMAT_TYPE = DELIMITEDTEXT,
366+
FORMAT_OPTIONS(
367+
FIELD_TERMINATOR = ',',
368+
STRING_DELIMITER = '"',
369+
FIRST_ROW = 2,
370+
USE_TYPE_DEFAULT = True)
371+
)
372+
```
373+
357374
## See Also
358375
[CREATE EXTERNAL DATA SOURCE &#40;Transact-SQL&#41;](../../t-sql/statements/create-external-data-source-transact-sql.md)
359376
[CREATE EXTERNAL TABLE &#40;Transact-SQL&#41;](../../t-sql/statements/create-external-table-transact-sql.md)

0 commit comments

Comments
 (0)