Post by alyc1208 non lo prende nell'ftp
vuole solo ccsid a byte singolo !!!
come dovrei fare per importare in IFS un file utf8 ?
The message On 28-Mar-2012, 08:40, alyc was not available on the
NewsServer that I use... so I am replying to this message.
I am not sure why the CCSID parameter of FTP limits support to
single-byte CCSID values. Regardless, what is done instead to have the
FTP create the target file with the correct CCSID is to use the
variation on the TYPE Subcommand supported by the IBM i FTP server and
client.
If the FTP is from the ASCII system as client to the IBM i as FTP
server, then issue the FTP subcommand request:
quote type c 1208
If the FTP is from IBM i as FTP client to the ASCII system as FTP
server, then issue the FTP subcommand request:
type c 1208
Regardless, I do not think the issue is with the initial import into
the IFS, when defaulting to "type c 819". I think in that scenario, the
FTP properly negotiates the ASCII transfer, and aside from possibly
recognizing the control characters as delimiter for end-of-record, I
suspect all other data is unchanged; i.e. aside from record delimiters,
I suspect the record data is treated as "binary stream data".?
I just tested with a file saved using UTF8 on a Mac using some
variety of characters that appear to be a variation on Quotation marks
[and similarly for the Apostrophe]; choices that the editor had to
offer, explaining what the UTF8 hex code points would be saved. The
"normal" quotation mark is 0x22 [GCGID: SP040000 "Quotation Marks"] and
the 'normal' apostrophe is 0x27 [GCGID: SP050000 "Apostrophe"], both
single-byte characters, transport as expected. The other characters
were all multi-byte characters, 3-bytes in UTF-8, and those transport as
expected too.
I transported the file in binary, TYPE Image, into two separate
files: utf8.bin utf8.txt
I then tagged the utf8.txt file with CCSID(1208) using the request to
CHGATR ATR(*CCSID) VALUE(1208). The utf8.bin bile was unchanged, having
been left with the CCSID(819) as created by the FTP.
When I viewed the file utf8.bin using DSPF F10=Hex, all three of the
multi-byte code points were shown in the hex, and each byte was
presented as the "#" character; the "replace unprintable character" for
the Display File (DSPF) command\feature similar to the RPLUNPRT()
feature of printer files.
When I viewed the file utf8.txt using DSPF F10=Hex, all three of the
multi-byte code points were shown in the hex, but all three were
presented as just the *one* "#" character; again, the "replace
unprintable character" for the Display File (DSPF) command\feature
similar to the RPLUNPRT() feature of printer files.
Having copied either of those stream files into a database file with
CVTDTA(*AUTO), telling the CPYFRMSTMF utility that the text stream data
was CCSID(1208), caused the multibyte characters to be replaced with the
EBCDIC "substitution character", hex code point 0x3F. Certainly not
much of a "best-fit" conversion\translation :-( where one might hope the
characters might instead translate to the /most similar/ EBCDIC
character 0x7F or 0x7D. The scripted requests follow:
≥ crtsrcpf qtemp/dta rcdlen(412) ccsid(1141)
≥ cpyfrmstmf utf8.txt '/qsys.lib/qtemp.lib/dta.file/utf8.mbr'
cvtdta(*auto) mbropt(*replace) stmfcodpag(1208) dbfccsid(1141)
endlinfmt(*crlf) /* or: dbfccsid(*file) stmfcodpag(*stmf) */
≥ dsppfm qtemp/dta mbr(utf8) /* use F10=Hex to see the x'3F' */
I am not sure why the Copy From Stream File utility does not issue a
warning message to indicate that "substitution characters" were used to
effect the conversion.? If the database had done the conversion, then I
believe such a warning is typically provided to the caller, and
generally [as I recall] a message is logged.
I suppose replacing CPYFRMSTMF with a custom utility, so perhaps a
better translation could be used [e.g. choosing a /best fit/ algorithm
from 1208 to 1144 with iconv might be an option and effective?] or so
perhaps even if the same effect with /substitution character/ then at
least log a warning for the data-loss for those multi-byte characters.?
I mention a custom replacement, because I presume that the Conversion
table (TBL) specification available from the CPYFRMSTMF utility does not
provide any support for multi-byte characters; I could easily be
wrong... I did not look further to see if there is some support.
What was odd for me, or at least something I do not recall as the
effect, is that my emulator [default\built-in telnet on the Mac] seems
to simply "lose" the x'3F' character in presentation; i.e. there is no
glyph, and the appearance of the data on the screen is shifted as if the
character was not even in the 5250 data stream. Thus if I issue a
request to RUNQRY *N ((QTEMP/DTA UTF8)) RCDSLT(*YES), my line of data
appears as the non-delimited 'Subject: empty string', even though the
DSPPFM F10=Hex shows the 0x3F "E2A48291 8583A37A 403F8594 97A3A83F"; and
of course, a search on SRCDTA LIKE '% empty%' will exclude that row, and
a search on SRCDTA LIKE '%_empty%' finds the row. On my PComm 5250
emulator I seem to recall the x'3F' appeared as a visible glyph.
For reference, some links which describe the characters which were
tested and problematic in my scenario. Very likely these variations of
the "quotation mark" character are the same issue you are experiencing.
http://www.tachyonsoft.com/uc0020.htm
http://unicode-search.net/unicode-namesearch.pl?term=QUOTATION
LEFT DOUBLE QUOTATION MARK RIGHT DOUBLE QUOTATION MARK
unicode: 201C utf8: E2 80 9C unicode: 201D utf8: E2 80 9D
DOUBLE HIGH-REVERSED-9 QU... DOUBLE LOW-9 QUOTATION MARK
unicode: 201F utf8: E2 80 9F unicode: 201E utf8: E2 80 9E
SINGLE HIGH-REVERSED-9 QU... SINGLE LOW-9 QUOTATION MARK
unicode: 201B utf8: E2 80 9B unicode: 201A utf8: E2 80 9A
LEFT SINGLE QUOTATION MARK RIGHT SINGLE QUOTATION MARK
unicode: 2018 utf8: E2 80 98 unicode: 2019 utf8: E2 80 99
Regards, Chuck