Discussione:
ftp + stmf + ccsid
(troppo vecchio per rispondere)
alyc
2012-03-23 17:30:40 UTC
Permalink
solito problema con codpage: (ovvi e venerdi, ma mi preparo x lunedi prossimo)
- tramite ftp, da as, vado a prendere un file txt su server web
- il file in ifs ha codepage 819
- creo un file su qtemp con crtf qtemp/filetemp rcdlen(1000)
- copio da ifs a filetemp con
CPYFRMSTMF FROMSTMF('cartella/filetxt') TOMBR('QSYS.LIB/QTEMP.LIB
/filetemp.FILE/filetemp.mbr') MBROPT(*REPLACE) CVTDTA(*AUTO) STMFCODPAG(*STMF) DBFCCSID(*FILE)

il 90% delle volte va tutto bene, ma in alcuni casi mi converte male dei caratteri come ad esempio le virgolette ( " ) ,
di conseguenza mi memorizzo nel db dei dati sballati che poi in 5250 alcune volte non riesco nemmeno a visualizzare.

Secondo voi cosa sbaglio ?
Perchè il sistema non fa la conversione come dovrebbe ?

grazie infinite e buon we !
Danilo Cussini
2012-03-26 08:14:07 UTC
Permalink
Post by alyc
solito problema con codpage: (ovvi e venerdi, ma mi preparo x lunedi prossimo)
- tramite ftp, da as, vado a prendere un file txt su server web
- il file in ifs ha codepage 819
- creo un file su qtemp con crtf qtemp/filetemp rcdlen(1000)
- copio da ifs a filetemp con
CPYFRMSTMF FROMSTMF('cartella/filetxt') TOMBR('QSYS.LIB/QTEMP.LIB
/filetemp.FILE/filetemp.mbr') MBROPT(*REPLACE) CVTDTA(*AUTO) STMFCODPAG(*STMF) DBFCCSID(*FILE)
il 90% delle volte va tutto bene, ma in alcuni casi mi converte male dei caratteri come ad esempio le virgolette ( " ) ,
di conseguenza mi memorizzo nel db dei dati sballati che poi in 5250 alcune volte non riesco nemmeno a visualizzare.
Secondo voi cosa sbaglio ?
Perchè il sistema non fa la conversione come dovrebbe ?
grazie infinite e buon we !
https://groups.google.com/forum/#!searchin/it.comp.as400/ccsid/it.comp.as400/qBOd9j78cq0/r3ZseznlS7EJ
CRPence
2012-03-26 19:02:45 UTC
Permalink
Post by alyc
solito problema con codpage: (ovvi e venerdi, ma mi preparo x lunedi prossimo)
- tramite ftp, da as, vado a prendere un file txt su server web
- il file in ifs ha codepage 819
CCSID 819? Best to verify that file; to ensure the data is correct,
matching the CCSID of the file.
ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP00819.pdf
Post by alyc
- creo un file su qtemp con crtf qtemp/filetemp rcdlen(1000)
CRTPF or CRTSRCPF?
Post by alyc
- copio da ifs a filetemp con
CPYFRMSTMF FROMSTMF('cartella/filetxt')
TOMBR('/QSYS.LIB/QTEMP.LIB/filetemp.FILE/filetemp.mbr')
MBROPT(*REPLACE) CVTDTA(*AUTO) STMFCODPAG(*STMF) DBFCCSID(*FILE)
Presumably the prior request was a CRTPF. If so, then the DBF CCSID
is *HEX, 65535, which means "do not translate". However, attempting to
be /friendly/ the Copy From Stream File utility assumes that the user
wants the encoding changed from ASCII to EBCDIC [per CVTDTA(*AUTO)], so
some code point conversion is performed anyhow, even though there was no
specific CCSID named, into which to convert the data. According to the
help text for the use of DBFCCSID(*FILE), the effect for that request
[using a database file tagged with *HEX] will be, that "the default job
CCSID is used." That means the effects for the request can change with
the CCSID of the job. That may or may not be an issue in this scenario.
Post by alyc
il 90% delle volte va tutto bene, ma in alcuni casi mi converte male
dei caratteri come ad esempio le virgolette ( " ),
The "Quotation Marks" special character named SP040000 is at the code
point 0x22 in the Code Page 819. Verify that the ASCII character is
represented by that code point. The DMP command was the method to get
the hex values from a STMF, but since some release [v5r3 for sure] the
F10=Hex within the DSPF command shows the ASCII code point values [in
some prior release(s), only the EBCDIC code points would be shown].
Post by alyc
di conseguenza mi memorizzo nel db dei dati sballati che poi in 5250
alcune volte non riesco nemmeno a visualizzare.
More important than the glyph, what might not be visible in a
particular font, what is the hex code point for the character? The
request to DSPPFM QTEMP/TEMPFILE can be used with F10=Hex [and
F11=Over\Under].
Post by alyc
Secondo voi cosa sbaglio ?
My guess is that the ASCII data from the web server was not all CCSID
819 data, even though the request to transport the data by FTP assumed
that was the scenario [FTP CCSID(*DFT) equates with CCSID(819)]. That
would mean the STMF in the IFS is improperly tagged.
Post by alyc
Perchè il sistema non fa la conversione come dovrebbe ?
I wonder if the data in ASCII are the characters SP220000 "Right
Double Quotes" and SP210000 "Left Double Quotes", such that the ASCII
transport and tagging should be with CCSID(1252) [STMFCODPAG(*PCASCII)]
instead of CCSID(819)?
ftp://ftp.software.ibm.com/software/globalization/gcoc/attachments/CP01252.pdf

Even so, I am not sure how much that would help. Perhaps conversion
mapping tables for the FTP [as I believe was alluded indirectly already
by Danilo, though I believe more appropriately the TBL() on CPYFRMSTMF;
i.e. the FTP was ASCII to ASCII?] might be required because...

In a quick test, the hex code points 0x93 and 0x94 from either ASCII
CCSID 819 or 1252, translated to the EBCDIC code points 0x33 and 0x34
which are "control characters" below EBCDIC 0x40 [space\blank] and are
[mostly] non-displayable in the 5250 data stream. Those two control
characters paired, with alphabetic data between, the string appeared to
cause the data to show in my emulator as inverse-red.
_Appendix G. Control Character Mappings_
http://www.ibm.com/software/globalization/cdra/appendix_g.html

FWiW I also recall problems with "smart quotes" which I recall were
from a multi-byte character CCSID, often utilized by MS document
utilities. Or maybe just those left\right quotes? Regardless, going
back to the original data, reviewed in code points to verify what is the
true ASCII data is IMO, of primary importance to help resolve.

Regards, Chuck
alyc
2012-03-27 07:30:17 UTC
Permalink
hi Chuck
thanks for reply,
i can't do tests until this afternoon,
i'll reply later
alyc
2012-03-27 14:13:46 UTC
Permalink
intanto vi aggiorno su quanto(poco) ho scoperto. (in italiano, altrimenti in inglese non arrivo+!!!)

il file su windows (aperto con un editor di testo) sembra avere la code page UTF-8.

Ora sicuramente la code paga 819 viene attribuita dall'ftp in fase di get del file.

Ora vorrei provare a fare la get impostando il ccsid dell'ftp in utf8 in modo che non mi alteri gli attributi del file txt:
( in base ad indicazioni passate di Stefano Tassi so che utf8 corrisponde a ccsid 1208, ma non ho trovato un doc. ufficiale di nell'area Glabalization di ibm).
comunque farei questa prova, se tutto va come credo in ifs dovrei trovarmi un file con ccsid 1208 e a questo punto il comando cpyfrmimpf dovrebbe fare la convesione corretta( per quanto possibile, per quanto non possibile metterà degli spazi ?)
alyc
2012-03-27 14:26:32 UTC
Permalink
1208 non lo prende nell'ftp
vuole solo ccsid a byte singolo !!!

come dovrei fare per importare in ifs un file utf8 ?
Danilo Cussini
2012-03-27 14:55:34 UTC
Permalink
Post by alyc
1208 non lo prende nell'ftp
vuole solo ccsid a byte singolo !!!
come dovrei fare per importare in ifs un file utf8 ?
Lo cambi prima della copia col comando CHGATR oppure lo specifichi nel comando CPYFRMSTMF.
stefano[dot]tassi[at]dedanext[dot]it
2012-03-27 14:59:39 UTC
Permalink
Post by alyc
1208 non lo prende nell'ftp
vuole solo ccsid a byte singolo !!!
come dovrei fare per importare in ifs un file utf8 ?
molto strano avere la codepage 1208 nel file su un client: solitamente
sono files a 819 con il BOM che specifica UTF8.

Cosa ti fa pensare che il ccsid del file sul client sia 1208?
--
--
http://www.linkedin.com/in/stefanotassi

Programming today is a race between software engineers striving to
build bigger and better idiotproof programs, and the Universe trying
to produce bigger and better idiots. So far the Universe is winning.
(Rick Cook)
alyc
2012-03-27 15:25:10 UTC
Permalink
Post by stefano[dot]tassi[at]dedanext[dot]it
Cosa ti fa pensare che il ccsid del file sul client sia 1208?
non ne capisco molto, ma aprendolo con PSPAD, mi dice che la codifica dei caratteri è UTF8

come altro posso fare per capire comè sto benedetto file ?
stefano[dot]tassi[at]dedanext[dot]it
2012-03-28 06:12:10 UTC
Permalink
Post by alyc
Post by stefano[dot]tassi[at]dedanext[dot]it
Cosa ti fa pensare che il ccsid del file sul client sia 1208?
non ne capisco molto, ma aprendolo con PSPAD, mi dice che la codifica dei caratteri è UTF8
come altro posso fare per capire comè sto benedetto file ?
ok, non hai controllato il ccsid che quasi sicuramente è 819 ma il
responso dell'editor, che lo ha correttamente interpretato; avrà il BOM
impostato per utf8.
Rimane comunque un file 819 con 3 byte in testa che specificano il fatto
che i byte seguenti conterranno caratteri encodati utf8.
Già il fatto di contenere i tre byte iniziali rappresenta un problema,
dal momento che un'eventuale importazione non li toglie, con evidenti
conseguenze.
Aggiungo: l'utf8 non è double byte, è multibyte, nel senso che, a
seconda del carattere che deve rappresentare, impegna uno due tre etc byte.
Per quel che ne so FTP non è in grado di gestire l'encoding utf8: o
meglio non lo gestisce "bene"
Ho trasferito con successo dei files utf8 lavorando in questo modo
1. privandoli dell'intestazione (lo potresti anche fare dopo come
postprocessing)
2. crei una tabella con un unica colonna utf8 (ccsid 1208)
3. trasferisci con FTP in modalità BIN
La tabella la trovi popolata con quanto presente nel file originale.
Rimane il problemino di leggerla/importarla nel db.
Se la tabella target non è utf8 puo' essere che tu non troverai le
corrispondenze tra il dato source e la tabella target.
Per leggere i record, che non sono piu' record, dal momento che il
trasferimento è fatto BIN, devi leggere il buffer fino ad incontrare il
cr , cr/lf (it depends) e convertire con apposita api (iconv) utf8 in un
ccsid digeribile dall'applicazione (13488 o 1200 se UCS2, forse la
soluzione migliore, 1144 se tradizionale con rischio di perdere
informazioni)

Se la tua tabella target è 1144 o 280 non spenderei tempo nel convertire
utf8, fattelo inviare con un encoding ANSI e non se ne parla piu' :)


Saluti
--
--
http://www.linkedin.com/in/stefanotassi

Programming today is a race between software engineers striving to
build bigger and better idiotproof programs, and the Universe trying
to produce bigger and better idiots. So far the Universe is winning.
(Rick Cook)
CRPence
2012-03-28 20:59:13 UTC
Permalink
Post by alyc
1208 non lo prende nell'ftp
vuole solo ccsid a byte singolo !!!
come dovrei fare per importare in IFS un file utf8 ?
The message On 28-Mar-2012, 08:40, alyc was not available on the
NewsServer that I use... so I am replying to this message.

I am not sure why the CCSID parameter of FTP limits support to
single-byte CCSID values. Regardless, what is done instead to have the
FTP create the target file with the correct CCSID is to use the
variation on the TYPE Subcommand supported by the IBM i FTP server and
client.

If the FTP is from the ASCII system as client to the IBM i as FTP
server, then issue the FTP subcommand request:
quote type c 1208

If the FTP is from IBM i as FTP client to the ASCII system as FTP
server, then issue the FTP subcommand request:
type c 1208

Regardless, I do not think the issue is with the initial import into
the IFS, when defaulting to "type c 819". I think in that scenario, the
FTP properly negotiates the ASCII transfer, and aside from possibly
recognizing the control characters as delimiter for end-of-record, I
suspect all other data is unchanged; i.e. aside from record delimiters,
I suspect the record data is treated as "binary stream data".?

I just tested with a file saved using UTF8 on a Mac using some
variety of characters that appear to be a variation on Quotation marks
[and similarly for the Apostrophe]; choices that the editor had to
offer, explaining what the UTF8 hex code points would be saved. The
"normal" quotation mark is 0x22 [GCGID: SP040000 "Quotation Marks"] and
the 'normal' apostrophe is 0x27 [GCGID: SP050000 "Apostrophe"], both
single-byte characters, transport as expected. The other characters
were all multi-byte characters, 3-bytes in UTF-8, and those transport as
expected too.

I transported the file in binary, TYPE Image, into two separate
files: utf8.bin utf8.txt

I then tagged the utf8.txt file with CCSID(1208) using the request to
CHGATR ATR(*CCSID) VALUE(1208). The utf8.bin bile was unchanged, having
been left with the CCSID(819) as created by the FTP.

When I viewed the file utf8.bin using DSPF F10=Hex, all three of the
multi-byte code points were shown in the hex, and each byte was
presented as the "#" character; the "replace unprintable character" for
the Display File (DSPF) command\feature similar to the RPLUNPRT()
feature of printer files.

When I viewed the file utf8.txt using DSPF F10=Hex, all three of the
multi-byte code points were shown in the hex, but all three were
presented as just the *one* "#" character; again, the "replace
unprintable character" for the Display File (DSPF) command\feature
similar to the RPLUNPRT() feature of printer files.

Having copied either of those stream files into a database file with
CVTDTA(*AUTO), telling the CPYFRMSTMF utility that the text stream data
was CCSID(1208), caused the multibyte characters to be replaced with the
EBCDIC "substitution character", hex code point 0x3F. Certainly not
much of a "best-fit" conversion\translation :-( where one might hope the
characters might instead translate to the /most similar/ EBCDIC
character 0x7F or 0x7D. The scripted requests follow:

≥ crtsrcpf qtemp/dta rcdlen(412) ccsid(1141)

≥ cpyfrmstmf utf8.txt '/qsys.lib/qtemp.lib/dta.file/utf8.mbr'
cvtdta(*auto) mbropt(*replace) stmfcodpag(1208) dbfccsid(1141)
endlinfmt(*crlf) /* or: dbfccsid(*file) stmfcodpag(*stmf) */

≥ dsppfm qtemp/dta mbr(utf8) /* use F10=Hex to see the x'3F' */

I am not sure why the Copy From Stream File utility does not issue a
warning message to indicate that "substitution characters" were used to
effect the conversion.? If the database had done the conversion, then I
believe such a warning is typically provided to the caller, and
generally [as I recall] a message is logged.

I suppose replacing CPYFRMSTMF with a custom utility, so perhaps a
better translation could be used [e.g. choosing a /best fit/ algorithm
from 1208 to 1144 with iconv might be an option and effective?] or so
perhaps even if the same effect with /substitution character/ then at
least log a warning for the data-loss for those multi-byte characters.?

I mention a custom replacement, because I presume that the Conversion
table (TBL) specification available from the CPYFRMSTMF utility does not
provide any support for multi-byte characters; I could easily be
wrong... I did not look further to see if there is some support.

What was odd for me, or at least something I do not recall as the
effect, is that my emulator [default\built-in telnet on the Mac] seems
to simply "lose" the x'3F' character in presentation; i.e. there is no
glyph, and the appearance of the data on the screen is shifted as if the
character was not even in the 5250 data stream. Thus if I issue a
request to RUNQRY *N ((QTEMP/DTA UTF8)) RCDSLT(*YES), my line of data
appears as the non-delimited 'Subject: empty string', even though the
DSPPFM F10=Hex shows the 0x3F "E2A48291 8583A37A 403F8594 97A3A83F"; and
of course, a search on SRCDTA LIKE '% empty%' will exclude that row, and
a search on SRCDTA LIKE '%_empty%' finds the row. On my PComm 5250
emulator I seem to recall the x'3F' appeared as a visible glyph.

For reference, some links which describe the characters which were
tested and problematic in my scenario. Very likely these variations of
the "quotation mark" character are the same issue you are experiencing.

http://www.tachyonsoft.com/uc0020.htm
http://unicode-search.net/unicode-namesearch.pl?term=QUOTATION

LEFT DOUBLE QUOTATION MARK RIGHT DOUBLE QUOTATION MARK
unicode: 201C utf8: E2 80 9C unicode: 201D utf8: E2 80 9D

DOUBLE HIGH-REVERSED-9 QU... DOUBLE LOW-9 QUOTATION MARK
unicode: 201F utf8: E2 80 9F unicode: 201E utf8: E2 80 9E

SINGLE HIGH-REVERSED-9 QU... SINGLE LOW-9 QUOTATION MARK
unicode: 201B utf8: E2 80 9B unicode: 201A utf8: E2 80 9A

LEFT SINGLE QUOTATION MARK RIGHT SINGLE QUOTATION MARK
unicode: 2018 utf8: E2 80 98 unicode: 2019 utf8: E2 80 99

Regards, Chuck

Loading...