regex – sed – 删除大型csv文件中引号内的引号
我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式. 我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据: 1,word1,"description for word1","another text",""text contains "double quotes" some more text" 2,word2,"description for word2","text may not contain double quotes,but may contain commas," 3,word3,"description for "word3"","more text and more" 所需的输出是: 1,"text contains double quotes some more text" 2,"description for word3","more text and more" 我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式: sed -i 's/(?<!^s*|,)""(?!,""|s*$)//g' *.txt sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt 这些来自以下问题,但似乎不适用于sed: Related question for perl Related question for SISS 原始文件是* .txt,我正在尝试用sed编辑它们. 解决方法这是使用GNU awk和 FPAT变量的一种方法:gawk 'BEGIN { FPAT="([^,]+)|("[^"]+")"; OFS=","; N=""" } { for (i=1;i<=NF;i++) if ($i ~ /^".*"$/) { gsub(/"/,"",$i); $i=N $i N } }1' file 结果: 1,"text contains double quotes some more text" 2,"another text","more text and more" 说明:
(编辑:青岛站长网) 【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容! |