Monday, November 27, 2017

Formant and resonance in western and Kunqu opera singing, litterature

The formants align because of the pitch, the volume and the vowel sound, and the shape we make while singing one. There are multitudes of possibilities with vowel sound shapes and very small differences can make the sound “maximally efficient” or not quite “good enough”. The jaw, the tongue, the mouth/lips, the back of the mouth (velo-pharyngeal port), the height of the back of the tongue, the height of the larynx and the amount of open/closed quotient as well as the depth of the vocal folds during vibration all play a part in the overall sound we hear when someone sings. The “at rest” position of the length of the folds, the size of the larynx, the size (both diameter and length of the vocal tract) of the throat and mouth cavities, and the bones of the head and face all play a part as well. And “resonance” as a destination isn’t needed in anything but classical repertoire and some kinds of music that might be done acoustically.

CONCLUSIONS: Formant tuning may be applied by a singer of the OM (old man) role, and both CF (color face) and OM role singers may use a rather pressed type of phonation, CF singers more than OM singers in the lower part of the pitch range. Most singers increased glottal adduction with rising F0.
测量了10个昆曲演员的long-term-average-spectrum,5个行当。没有发现singing formant, 发现花脸在3kHz位置有speaker formant. LTAS跟普通说话人的差别很大,跟美声的差别也很大。

Saturday, November 25, 2017

Optimizing DTW-based audio-to-MIDI alignment and matching, Colin Raffel paper

This paper introduced a method of optimizing various DTW parameters on a synthetic MIDI dataset. He optimized the mean absolute alignment error by Bayesian optimization and the confidence score by exhaustive search.

Some interesting points in the paper:
(1) The best alignment systems don't use beat-synchronous feature.

(2) He introduced two penalties. The first one to penalize the "non-diagonal move", the second to ensure the entire subsequence is used when doing subsequence alignment. Best systems use median values for both penalties.

(3) The synthetic midi method includes change tempo, crop midi segment, delete the vocal track, change instrument timbre and change velocity. All is done by pretty_midi.

(4) He evaluated the matching confidence score by calculating the Kendell rank correlation between the score and the alignment absolute error, which means the error is the ground truth matching confidence score.

(5) All of the systems achieved the highest correlation when including the penalties in the score calculation, normalizing by the path length, and normalizing by the mean distance across the aligned portions.

Friday, November 24, 2017

Deep learning, where are you going?

A talk by Kyunghyun Cho, a professor from New York univ. The name is "Deep learning, where are you going?" Things to take away for me:

(1) Currently, most people using neural network to do one specific task. They grab the data and annotation, build an architecture and train the model. However,  as time goes, the trained model become isolated because new information comes around. In such way, we have to retrain the model with newly collected data. So how could we benefit with the pre-trained model? The idea would be combining different pre-trained models to do a more complex task or using another neural net to interpret the pre-trained model.

(2) The idea of multilingual translation is to train a shared continuous language space (word, character). He found char2char model is better than word2word or word2char models. Additionally, you can do mixed languages translation where the input sentence is mixed with such as English, French, etc.