Maintain structure of content in XML?

Johnny Dubs

Member
Joined
Jan 17, 2009
Messages
7
Programming Experience
Beginner
Hi there,


So I'm creating an application to parse particular documents (in this case e-mails) to XML. I have no problems parsing the computer generated tags in the e-mails (message id, to, from etc), but when it comes to the content of the e-mail I have problems.

I read the file in using the StreamReader function initially. I then read the file line by line. When it comes to reading the content of the e-mail I just use

VB.NET:
                sInputLine = FileReader.ReadLine()
                temp &= sInputLine & vbCrLf

in a loop, and I've tested outputting this to a text file, which reproduces the information with the same structure as the original e-mail.

But when I try and put this into the xml using:


VB.NET:
            Dim xml_content As XmlElement
            xml_content = Doc.CreateElement("field")
            xml_content.InnerText = "" & content & ""

(where content has been assigned the value of temp) it reproduces all of the information as one big group of words.

Since I need to analyse this content once it's in XML format it's important I maintain the structure...does anyone know how to?


Cheers!
 
it reproduces all of the information as one big group of words.
Don't know what you mean by this, Xml does "maintain structure of content" as far as I know. You should still store the raw email content as Base64 string to avoid interference with possible markup. Example converting to and from:
VB.NET:
Dim content As String = "test"
Dim contentBytes() As Byte = System.Text.Encoding.ASCII.GetBytes(content)
Dim contentB64 As String = Convert.ToBase64String(contentBytes)
contentBytes = Convert.FromBase64String(contentB64)
content = System.Text.Encoding.ASCII.GetString(contentBytes)
 
When I say it doesn't maintain the structure I mean it comes out like this:

1. SPECIAL ANNOUNCEMENT: Treat yourself to Multex Investor's NEW Personal Finance Channel to take advantage of top-notch content and tools FREE.2. DAILY FREE SPONSOR REPORT: Robertson Stephens maintains a "buy" rating on Divine Interventures (DVIN).3. FREE RESEARCH REPORT: Jefferies & Co. rates America Online (AOL) a "buy," saying projected growth remains in place.

rather than like this:

1. SPECIAL ANNOUNCEMENT: Treat yourself to Multex Investor's NEW Personal
Finance Channel to take advantage of top-notch content and tools FREE.

2. DAILY FREE SPONSOR REPORT: Robertson Stephens maintains a "buy" rating
on Divine Interventures (DVIN).

3. FREE RESEARCH REPORT: Jefferies & Co. rates America Online (AOL) a
"buy," saying projected growth remains in place.


The vbCrLf in "temp &= sInputLine & vbCrLf" above is there to try and input a newline after each line, but it doesn't seem to make a difference when it's converted to XML (unlike when it's output to a normal text file, and it appears as I want it to).
 
When it's converted to XML using the code above it just doesn't include any of the spaces I try and enforce. I don't know if there's a special XML code for a line break that I need to include because the transition of vbCrLf from VB.NET to XML doesn't work?
 
When it's converted to XML using the code above it just doesn't include any of the spaces I try and enforce. I don't know if there's a special XML code for a line break that I need to include because the transition of vbCrLf from VB.NET to XML doesn't work?
I really don't see what you mean. Using the code you posted preserves all whitespace. You verify this code that puts text with cr+lf and reads it back and check for presence and index of linefeed, ix variable will be 1:
VB.NET:
xml_content.InnerText = "a" & vbNewLine & "b"
Dim ix As Integer = xml_content.InnerText.IndexOf(vbNewLine)
You can also write it to a file and load it back and it still hasn't lost anything.
 
Care to explain yourself? I'm starting to think you're just bullshtting now.
 
I've got a feeling this is an editor dependant issue!

Some editors show the spaces, others (i.e. IE!!!) don't. Other don't but include the [] line break character where there should be a new line.
 
I see, you are displaying an Xml file in Internet Explorer, why didn't you just say that? IE doesn't display Xml correctly, it displays it "conveniently" with node tree expansion and such features, look at source for page and you'll see the actual correct source file. You can't control how IE will display Xml, it favours active navigation of feeds and such content and xml structure of these, the whitespace of a text node is of no concern to that application.
 
If you want to display Xml content in IE, you either have to put up with how it interprets things, or attach a Xsl stylesheet where you transform the content to Html, linebreaks then have to be outputted Html style as <br/> elements.
 
Ha ha ha.. This reminds me of when Experian (the credit company advertising on this very forum :) ) sent me a bunch of example XML data to code up integration with their services.. and it had - signs all through it.. I was thinking "WTF is this?" and then I realised that one of their nuggets had opened an XML file in IE, copied what they saw, pasted into a word document and attached the doc to an email. The - signs were IE's rendering of the node collapser buttons.


That (and their insistence that a word document spec was better than an XSD), for me, kinda summed up the level of technical expertise I was having to work with. . :rolleyes:
 
Back
Top