So, what context are they claiming might be changed because of this conversion? When you convert WAV to MP3 the conversion also goes through a similar conversion process. I challenge any of you ordinary music listeners to play the same song in WAV and MP3 to hear any difference.
No.
Most WAV files are pulled from CDs, which have a sampling rate of 44.1khz (44,100 samples per second). Sometimes they come from digital sources like DAT, which is 48khz.
The conversion to MP3 is very configurable, but most often the sampling rate is left the same. In fact, just about the only time you see the sampling rate reduced is when it's spoken word (like an audiobook) which limits the impact to the sound. Music dropped from 44khz to 22khz is absolutely noticeable.
But in either case, we are talking about 3 orders of magnitude more samples than video. If you dropped the audio to 30 samples per second it would likley be unrecognizable (probably sound like a speak and spell).
But even more to the point, audio and video are not encoded the same way.
Ignoring the Analog to Digital process, an uncompressed digital audio signal represents an actual sound wave (or several) using a 16bit numbers taken once per sample period (I.E. 44,100 times in a second). Most lossy compression schemes (like Mpeg 1 Layer 3 Audio, I.E. MP3) for audio use models to build approximate matches to the audio wave. The more source "wave" algorithms and the more advanced ways to combine them the more processing power is needed to encode/decode and (typically) the better the approximation/result. But the key here is that by approximating an actual wave, and not reproducing samples of a wave, you can play the result at any sampling frequency without affecting the original timing/tempo of the source (you basically have infinite samples from which to choose from).
This is completely different from video. In video you have an image built of individual pixels (say 1920 veritcal by 1080 horizontal) and each pixel can be (in 10 bit color) over a billion different colors. Most lossy compression schemes for video first reduce the number of colors per frame by identifying groups of close colors and making them the same, then they build a "key" frame that has recognizable shapes of these like colors, and for every frame in between a key frame (I/B frames) these shapes are tracked, and the transformation and movement of each is tracke. This allows you to discard these frames and during runtime "rebuild" them using the data from the key frames before and after, and the shape tracking algorithms.
With video the fewer the frames (and more importantly the fewer the key frames) the greater the amount of movement has to be approximated. Add to that the fact that video is played typically in frames synced to original media sources (24fps for film, 29.97fps for NTSC TV, 25fps for PAL TV being the three major ones) you have issues where converting between one and another while keeping the "key" frames synchronized in time causes you to drop I/B frames (for film to TV conversion this is called 3:2 pulldown, where every 4 frames of video is turned into 5 frames, witha slight jittery pause is interjected because the last frame is copied rather than interpolated to be the average between two frames).
This is further complicated when older "web graphic" standards (.gif, MJPEG, etc.) introduced lower frame rates to save on space (15fps and 12fps specifically, as they are roughly half the original standards) and did not bother to build I/B frames but rather display the same frame until a new one is synced (so if your monitor is set for 30hz refresh, playing a 15fps .GIF will display every frame twice).
So to put it simply, when converting 29.97 NTSC video (or in this case 30fps digital tv camera source) to 15fps for web-video half of the information is lost and all "motion" becomes more jittery (or as described in this video more "aggressive").