regex – sed – 删除大型csv文件中引号内的引号

发布时间：2020-12-24 22:30:36 所属栏目：Linux 来源：网络整理

导读：我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式. 我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据： 1,word1,"description for word1","another text",""text contains "double quotes" some more text"2,word2,"description

我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据：

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","text may not contain double quotes,but may contain commas,"
3,word3,"description for "word3"","more text and more"

所需的输出是：

1,"text contains double quotes some more text"
2,"description for word3","more text and more"

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式：

sed -i 's/(?<!^s*|,)""(?!,""|s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题,但似乎不适用于sed：

解决方法

这是使用GNU awk和 FPAT变量的一种方法：

gawk 'BEGIN { FPAT="([^,]+)|("[^"]+")"; OFS=","; N=""" } { for (i=1;i<=NF;i++) if ($i ~ /^".*"$/) { gsub(/"/,"",$i); $i=N $i N } }1' file

结果：

1,"text contains double
quotes some more text" 2,"another
text","more text and more"

说明：

Using FPAT,a field is defined as either “anything that is not a comma,” or “a double quote,anything that is not a double quote,and a closing double quote”. Then on every line of input,loop through each field and if the field starts and ends with a double quote,remove all quotes from the field. Finally,add double quotes surrounding the field.

（编辑：青岛站长网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!

linux怎样查看oracle是	linux中32位和64位的差
linux如何关掉网卡	怎样一条命令，榨干机