Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email parser preserves leading whitespace at beginning of wrapped header value #109252

Open
fsc-eriker opened this issue Sep 11, 2023 · 1 comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@fsc-eriker
Copy link

fsc-eriker commented Sep 11, 2023

Bug report

Bug description:

The email library preserves whitespace when a line is folded.

from email import message_from_string
from email.policy import default

message = message_from_string("Message-id:\r\n\t<abcdef@example.com>\r\n\r\n")
assert message['message-id'] == '<abcdef@example.com>', \
    f"Expected '<abcdef@example.com>' but got {message['message-id']!r}"

The failure message says

Traceback (most recent call last):
  File "/Users/myself/work/email-cpython-fork/repronnnn.py", line 5, in <module>
    assert message['message-id'] == '<abcdef@example.com>', \
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Expected '<abcdef@example.com>' but got '\r\n\t<abcdef@example.com>'

To quote RFC5322 section 3.2.2, 'any CRLF that appears in FWS is semantically "invisible"' (where "FWS" means "folding whitespace"). Later on, the section concludes, ' Runs of FWS, comment, or CFWS that occur between lexical tokens in a structured header field are semantically interpreted as a single space character' (where CFWS means runs of parenthesized comments and/or FWS).

(The Message-Id header is a structured field. Elsewhere in the RFC, the Subject: header, which is unstructured, is used in an example to illustrate this mechanism, though I'd still have to verify whether there is a gap in the RFC when it comes to specifying the semantics of non-CRLF FWS in unstructured header values.)

CPython versions tested on:

3.9, 3.11, CPython main branch

Operating systems tested on:

macOS

@nmeum
Copy link

nmeum commented Mar 5, 2024

Hi, I believe this issue also causes problems with software doing DKIM (RFC 6376) signature validation with the Python email library. The "simple" Header Canonicalization Algorithm specified in RFC 6376 is sensitive to whitespace, Thus, if the Python email library is used to extract headers from an email (to verify the DKIM signature) then this signature verification will fail (due to the whitespaces added by the Python library).

As an example, consider:

>>> import email
>>> msg = email.message_from_string("Foo:\n bar")
>>> msg.as_string()
'Foo: \n bar\n\n'

In this example, a whitespaces was added after the header Foo:.

This was discovered while debugging a DKIM signature validation failure in SourceHut: https://lists.sr.ht/~sircmpwn/sr.ht-discuss/%3C3D5N7D1SV2P7R.2WTGBFT8VHKD7%408pit.net%3E

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

4 participants
@nmeum @AlexWaygood @fsc-eriker and others