Prologue
The following is an attempt at highlighting the sources of high
astonishment
arising from the implementation of the
std::ops::Deref
trait on
custom types.
I write this, in part, because I had recently used Deref
inappropriately while
at $day_job
, due to my lack of a full understanding of how Deref
interacts
with the rest of the language. It should go without saying that this was a
confusing experience.
Naturally, the aforementioned mishap will serve as a motivating example throughout this dive into Rust semantics.
Newtypes
I cannot share the full context surrounding my misuse of Deref
(for both
pedagogic and legal reasons) so please understand that my motivating example
won't be well motivated itself.
With that disclaimer out of the way, suppose we're building a library that
provides an ASCII string type called
AsciiString
. A fitting representation of these so-called byte strings would
simply be a plain ol' Vec<u8>
.
However, ASCII characters are limited to the range 0..128
, and thus not any
value of type Vec<u8>
would be a valid ASCII string. In other words, it's not
a good idea to
alias
AsciiString
to Vec<u8>
; they are not
interchangeable.
In Rust1, it is considered good practice to enforce invariants at the type
level. Thus, the idiomatic course of action would be to use a
newtype to
wrap a Vec<u8>
and then statically guarantee that all wrapped byte vectors
represent valid ASCII strings.
/// An ASCII-encoded, growable string.
struct AsciiString(Vec<u8>);
impl AsciiString {
/// Range of valid ASCII bytes.
const ASCII_RANGE: Range<u8> = 0..128;
/// Converts a byte vector to an `AsciiString`.
///
/// # Errors
/// Returns `Err` if any byte in `vec`
/// falls outside the range [`Self::ASCII_RANGE`].
fn from_ascii(vec: Vec<u8>) -> Result<Self, AsciiError> {
if vec
.iter()
.all(|b| Self::ASCII_RANGE.contains(b))
{
Ok(Self(vec))
} else {
Err(AsciiError)
}
}
}
Explicit conversion
UTF-8 is backward compatible with ASCII. This means that we can freely
reinterpret any well-constructed AsciiString
value to a str
value. What we
stand to gain from this is compatibility with code written with UTF-8 strings in
mind.
impl AsciiString {
fn as_str(&self) -> &str {
debug_assert!(str::from_utf8(&self.0).is_ok());
// SAFETY: self.0 is ASCII-encoded,
// and ASCII is a subset of UTF-8
unsafe { str::from_utf8_unchecked(&self.0) }
}
}
Two statements in, and the unspeakable keyword has already been uttered; I'm
feeling pretty tall
today and the temptation is frankly irresistible. Obviously, any bug surrounding
the construction of AsciiString
values implies memory unsafely outside debug
mode.
Hold on a second. Both
std::str::from_utf8 and
std::str::from_utf8_unchecked
take in a &[u8]
as an argument, while &self.0
is of type &Vec<u8>
; there
is definitely some kind of implicit coercion going on here. Is Rust not so
strongly typed after all?
Enter Deref
pub trait Deref {
type Target: ?Sized;
fn deref(&self) -> &Self::Target;
}
We seem to have inadvertently run into Deref
before I even got the chance to
introduce it properly. The simple
explanation of why &Vec<u8>
was "coerced" to &[u8]
is that Vec<T>
implements
Deref<Target = [T]>
. In short, the compiler conceptually transformed the
expression str::from_utf8(&self.0)
into something akin to:
str::from_utf8(<Vec<u8> as Deref>::deref(&self.0))
This mechanism at play here is called Deref Coercion, and is defined by the interaction between two distinct mechanisms defined by The Rust Reference: Method Resolution and Type Coercion. Having already dropped so much jargon, I refrain from uncovering any gory details for now.
Clean Code™
For all practical purposes, AsciiString:as_str
works perfectly as a sure and
simple way to convert any &AsciiString
reference to a &str
. However, it will
mean one would have to manually add .as_str()
every time a conversion is
needed.
For dubious reasons involving "clean code" we shall implement Deref<Target = str>
for AsciiString
: the compiler will from now on call .as_str()
on our
behalf.
impl Deref for AsciiString {
type Target = str;
fn deref(&self) -> &Self::Target {
self.as_str()
}
}
Idiomatic Code™
For those who care not for aesthetics when it comes to computer programs — a vulgar but arguably rational position — I shall appeal to Rust idioms2. Chapter 5 of the Rust API Guidelines contains a recommendation titled C-DEREF which states:
The Deref traits are used implicitly by the compiler in many circumstances, and interact with method resolution. The relevant rules are designed specifically to accommodate smart pointers, and so the traits should be used only for that purpose.
Is AsciiString
a smart pointer? What is a smart pointer anyway? That is a
debate I'm
uninterested in exploring here. However, the authors of the Rust API Guidelines
go on to give examples of smart pointers in the Standard Library:
Box<T>
String
as a smart pointer tostr
Rc<T>
Arc<T>
Cow<'a, T>
On these grounds, I claim that AsciiString
ought to be considered a smart
pointer to str
. Note that the idea that String
is a smart pointer is itself
often contested, since String
is not generic over the data it points to the
way Box<T>
and friends are.
To my astonishment
Note that AsciiString
inherits the implementation of Index<Range<usize>>
from str
. We get this as a side effect of implementing Deref
and it allows
us to borrow string slices from an AsciiString
3.
let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1..3], "BC");
Up until now, we have been relatively well-behaved rustaceans. But where is the fun in that? One cannot be writing production-ready code all the time, for that is a sure path to madness.
Let's allow for AsciiString
to be indexed using floating-point numbers;
precisely because we can.
impl Index<Range<f64>> for AsciiString {
type Output = str;
/// Returns an ASCII string slice by rounding
/// the given floating-point indices in `index`.
fn index(&self, index: Range<f64>) -> &Self::Output {
/// Returns the integer nearest
/// to the absolute value of `float`.
fn nearest_index(float: f64) -> usize {
float.abs().round() as usize
}
let start = nearest_index(index.start);
let end = nearest_index(index.end);
let string = self.as_str();
&string[start..end]
}
}
You may be wondering what happens when your index is exactly halfway between two
integers. The answer is that f64::round
will round away from 0.0
, thus
0.5
will be rounded to 1
. At this point, we might as well have had Index
toss a coin to resolve this ambiguity.
let data = vec![b'A', b'B', b'C'];
let ascii = AsciiString::from_ascii(data).unwrap();
assert_eq!(&ascii[1.5_f64..3_f64], "BC");
We finally arrived at the part which surprised me and served as the motivation
for all that I wrote before and after this point: we no longer inherit
Index<Range<usize>>
from str
. The previous version of the above code
block doesn't compile anymore:
error[E0308]: mismatched types
--> src/main.rs:69:23
|
69 | assert_eq!(&ascii[1..3], "BC");
| ^^^^ expected `Range<f64>`,
| found `Range<{integer}>`
|
= note: expected struct `std::ops::Range<f64>`
found struct `std::ops::Range<{integer}>`
To me, this was incredibly confusing because Index
is generic over its Idx
type, which means it's perfectly valid to have multiple implementations of
Index<Idx>
for different Idx
types4.
Somehow, the compiler can no longer find the implementation of Index<usize>
on
&str
. If you're like me, it's not at all obvious how that came to be. To
better understand all of this, we dig deeper into exactly how the compiler
resolves methods in the presence of a Deref
implementation.
Nitty-gritty details
Everything I describe below is limited to commit 6bacf5a of the Rust project, i.e. Rust 1.69.0.
Method lookup in rustc
is divided into two phases:
probe and
confirm. As
the Rust Compiler Development
Guide explains,
probe
is concerned with finding the callee method while confirm
solves for
unknown type variables.
During the probe
phase, the compiler would first perform a "Deref-loop":
essentially repeated dereferencing of the receiver type, i.e. AsciiString
. In
our particular case, this would yield two steps: AsciiString
and [u8]
.
Then, the compiler would assemble a list of candidates from each step. These
candidate methods are subsequently searched for one that fits the reliever
type. Up until this point, it seems that rustc
should've been able to find the
index
method from the Index<Range<usize>>
implementation on [u8]
because
the search always includes Deref targets.
The issue, it turns out, is that method search does not check method parameter
types against argument types. Rather, it only checks for method names and
where-clauses such as T: Trait
. Types are only resolved in the confirm
phase,
at which point rustc
would have already picked its method.
But why is method lookup in rustc
split into two phases? The answer lies in
the cacheability of the picks provided by the probe
phase. In other words,
rustc
developers wished to reuse probe
search results of identical methods,
hence why probe
cannot depend on inferred types local to one particular call
site. I would wager that this was most likely done to address performance
issues.
If you can spot any inaccuracies in the above description, please don't shy away
from correcting me via email. I would appreciate it, as I am by no means a
rustc
developer5.
Epilogue
Moral of the story: don't implement Deref
kids!
On a more serious note, if one does implement Deref<Target = U>
on a type T
,
and it proved necessary to implement a generic trait Trait<S>
on T
. Then
one should ensure that the implementations of Trait<S>
for different
parameters S
are not scattered across T
and U
.
Put simply, either T
or U
should host all existing Trait<S>
implementations. Note that the preceding "or" is inclusive, meaning that both
T
and U
can host some Trait<S>
implementations, as long as they both host
all of them.
This is not unique to Rust and is arguably true for all languages featuring strong type systems, e.g. Haskell.
I leave it to the reader to decide whether that is a risk worth taking.
If you care for neither aesthetics nor idioms, then you are truly lost.
Ideally, we should create an AsciiStr
type, which implements a slice of
AsciiString
since using &str
means losing knowledge of the ASCII
invariant at the type level.
Indeed, The issue is "resolved" by (re)implementing Index<usize>
directly on AsciiString
.
Yet!